On 2010-07-07 22:32, Ken Krugler wrote:
Hi Julien,

See https://issues.apache.org/jira/browse/TIKA-457 for a description
of one of the cases found by Andrzej. There seems to be something very
wrong with the way <body> is handled, we also saw cases were it was
twice in the output.

Don't know about the case of it appearing twice.

But for the above issue, I added a comment. The test HTML is badly
broken, in that you can either have a <body> OR a <frameset>, but not both.

The HTML was broken on purpose - one of the goals of the original test was to get as much content and links in presence of grave errors - as you know even major sites often produce a badly broken HTML, but the parser sanitize it and produce a valid DOM. In this case, it produced two nested <body> elements, which is not valid. I should also mention that NekoHTML handled this test much better, by removing the <body> and retaining only the <frameset>.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to