Hi Tim, A link to the TagSoup documentation already exists some lines further above in the article.
I have slightly changed the text and the example, I hope this makes it less confusing. Cheers, Christian On Thu, Feb 2, 2023 at 6:32 PM Timothée <timog...@gmail.com> wrote: > > Hello Christian, > > Thanks for your reply. I added the documents again with the default options > and I do get satisfying results. Not sure why I kept on using the settings > recommended in the documentation... > Would it be possible to add the tagsoup documentation link about the parser > options to the BaseX doc? That could be helpful. > > Thanks, > - Tim > > On Mon, Jan 30, 2023 at 10:42 PM Christian Grün <christian.gr...@gmail.com> > wrote: >> >> Hi Tim, >> >> I assume the article element will be preserved if you omit the >> nobogons HTMLPARSER option [1]. Usually, there’s no need to set >> specific options if the default behavior gives satisfying results. >> >> Best, >> Christian >> >> [1] http://vrici.lojban.org/~cowan/tagsoup/ >> >> >> >> On Fri, Jan 27, 2023 at 8:05 PM Timothée <timog...@gmail.com> wrote: >> > >> > Hello all, >> > >> > I am trying to store HTML documents in BaseX. I setup a local instance of >> > BaseX on my computer using Docker, and I imported this file in it: >> > https://pastebin.com/HJdJgLv9 >> > >> > On my local BaseX instance, the document is imported and >> > "/html/body/article" does return the <article> node as expected. >> > >> > On my remote/production BaseX instance (using the same Dockerfile and >> > image), the document is imported but the <article> tag is "stripped" (even >> > though its contents / child nodes remain in the imported document). >> > "/html/body/article" is empty. >> > >> > If I copy over the .basex files from my local database to my remote >> > database, then the documents are complete like on my local instance. I >> > also tried to import the documents again on my local instance, and the >> > <article> tag gets stripped too (and the child nodes remain). >> > >> > What am I doing wrong when importing my documents? What did I do to import >> > them properly in my current local instance? I tried a lot of options but I >> > just can't figure out why this happens (I fiddled a lot with it). >> > >> > I used the following options when importing my documents, as per the >> > documentation: >> > SET PARSER html >> > SET HTMLPARSER >> > method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true >> > SET CREATEFILTER *.html >> > >> > I also use SET FTINDEX true but I don't think it would have an impact >> > anyway. >> > >> > Thank you very much! >> > - Tim