Re: [basex-talk] Different results when importing HTML documents

Christian Grün Mon, 30 Jan 2023 22:43:08 -0800

Hi Tim,

I assume the article element will be preserved if you omit the
nobogons HTMLPARSER option [1]. Usually, there’s no need to set
specific options if the default behavior gives satisfying results.


Best,
Christian

[1] http://vrici.lojban.org/~cowan/tagsoup/



On Fri, Jan 27, 2023 at 8:05 PM Timothée <timog...@gmail.com> wrote:
>
> Hello all,
>
> I am trying to store HTML documents in BaseX. I setup a local instance of 
> BaseX on my computer using Docker, and I imported this file in it: 
> https://pastebin.com/HJdJgLv9
>
> On my local BaseX instance, the document is imported and "/html/body/article" 
> does return the <article> node as expected.
>
> On my remote/production BaseX instance (using the same Dockerfile and image), 
> the document is imported but the <article> tag is "stripped" (even though its 
> contents / child nodes remain in the imported document). "/html/body/article" 
> is empty.
>
> If I copy over the .basex files from my local database to my remote database, 
> then the documents are complete like on my local instance. I also tried to 
> import the documents again on my local instance, and the <article> tag gets 
> stripped too (and the child nodes remain).
>
> What am I doing wrong when importing my documents? What did I do to import 
> them properly in my current local instance? I tried a lot of options but I 
> just can't figure out why this happens (I fiddled a lot with it).
>
> I used the following options when importing my documents, as per the 
> documentation:
> SET PARSER html
> SET HTMLPARSER 
> method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true
> SET CREATEFILTER *.html
>
> I also use SET FTINDEX true but I don't think it would have an impact anyway.
>
> Thank you very much!
> - Tim

Re: [basex-talk] Different results when importing HTML documents

Reply via email to