On 2010-07-07 22:32, Ken Krugler wrote:
Hi Julien,
See https://issues.apache.org/jira/browse/TIKA-457 for a description
of one of the cases found by Andrzej. There seems to be something very
wrong with the way <body> is handled, we also saw cases were it was
twice in the output.
Don't know about the case of it appearing twice.
But for the above issue, I added a comment. The test HTML is badly
broken, in that you can either have a <body> OR a <frameset>, but not both.
The HTML was broken on purpose - one of the goals of the original test
was to get as much content and links in presence of grave errors - as
you know even major sites often produce a badly broken HTML, but the
parser sanitize it and produce a valid DOM. In this case, it produced
two nested <body> elements, which is not valid. I should also mention
that NekoHTML handled this test much better, by removing the <body> and
retaining only the <frameset>.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com