Hi Tim, I assume the article element will be preserved if you omit the nobogons HTMLPARSER option [1]. Usually, there’s no need to set specific options if the default behavior gives satisfying results.
Best, Christian [1] http://vrici.lojban.org/~cowan/tagsoup/ On Fri, Jan 27, 2023 at 8:05 PM Timothée <timog...@gmail.com> wrote: > > Hello all, > > I am trying to store HTML documents in BaseX. I setup a local instance of > BaseX on my computer using Docker, and I imported this file in it: > https://pastebin.com/HJdJgLv9 > > On my local BaseX instance, the document is imported and "/html/body/article" > does return the <article> node as expected. > > On my remote/production BaseX instance (using the same Dockerfile and image), > the document is imported but the <article> tag is "stripped" (even though its > contents / child nodes remain in the imported document). "/html/body/article" > is empty. > > If I copy over the .basex files from my local database to my remote database, > then the documents are complete like on my local instance. I also tried to > import the documents again on my local instance, and the <article> tag gets > stripped too (and the child nodes remain). > > What am I doing wrong when importing my documents? What did I do to import > them properly in my current local instance? I tried a lot of options but I > just can't figure out why this happens (I fiddled a lot with it). > > I used the following options when importing my documents, as per the > documentation: > SET PARSER html > SET HTMLPARSER > method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true > SET CREATEFILTER *.html > > I also use SET FTINDEX true but I don't think it would have an impact anyway. > > Thank you very much! > - Tim