Re: [basex-talk] Different results when importing HTML documents

Christian Grün Thu, 02 Feb 2023 10:36:49 -0800

Hi Tim,

A link to the TagSoup documentation already exists some lines further
above in the article.


I have slightly changed the text and the example, I hope this makes it
less confusing.

Cheers,
Christian



On Thu, Feb 2, 2023 at 6:32 PM Timothée <timog...@gmail.com> wrote:
>
> Hello Christian,
>
> Thanks for your reply. I added the documents again with the default options 
> and I do get satisfying results. Not sure why I kept on using the settings 
> recommended in the documentation...
> Would it be possible to add the tagsoup documentation link about the parser 
> options to the BaseX doc? That could be helpful.
>
> Thanks,
> - Tim
>
> On Mon, Jan 30, 2023 at 10:42 PM Christian Grün <christian.gr...@gmail.com> 
> wrote:
>>
>> Hi Tim,
>>
>> I assume the article element will be preserved if you omit the
>> nobogons HTMLPARSER option [1]. Usually, there’s no need to set
>> specific options if the default behavior gives satisfying results.
>>
>> Best,
>> Christian
>>
>> [1] http://vrici.lojban.org/~cowan/tagsoup/
>>
>>
>>
>> On Fri, Jan 27, 2023 at 8:05 PM Timothée <timog...@gmail.com> wrote:
>> >
>> > Hello all,
>> >
>> > I am trying to store HTML documents in BaseX. I setup a local instance of 
>> > BaseX on my computer using Docker, and I imported this file in it: 
>> > https://pastebin.com/HJdJgLv9
>> >
>> > On my local BaseX instance, the document is imported and 
>> > "/html/body/article" does return the <article> node as expected.
>> >
>> > On my remote/production BaseX instance (using the same Dockerfile and 
>> > image), the document is imported but the <article> tag is "stripped" (even 
>> > though its contents / child nodes remain in the imported document). 
>> > "/html/body/article" is empty.
>> >
>> > If I copy over the .basex files from my local database to my remote 
>> > database, then the documents are complete like on my local instance. I 
>> > also tried to import the documents again on my local instance, and the 
>> > <article> tag gets stripped too (and the child nodes remain).
>> >
>> > What am I doing wrong when importing my documents? What did I do to import 
>> > them properly in my current local instance? I tried a lot of options but I 
>> > just can't figure out why this happens (I fiddled a lot with it).
>> >
>> > I used the following options when importing my documents, as per the 
>> > documentation:
>> > SET PARSER html
>> > SET HTMLPARSER 
>> > method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true
>> > SET CREATEFILTER *.html
>> >
>> > I also use SET FTINDEX true but I don't think it would have an impact 
>> > anyway.
>> >
>> > Thank you very much!
>> > - Tim

Re: [basex-talk] Different results when importing HTML documents

Reply via email to