On Jul 8, 2010, at 12:15am, Andrzej Bialecki wrote:
On 2010-07-07 22:32, Ken Krugler wrote:
Hi Julien,
See https://issues.apache.org/jira/browse/TIKA-457 for a description
of one of the cases found by Andrzej. There seems to be something
very
wrong with the way <body> is handled, we also saw cases were it was
twice in the output.
Don't know about the case of it appearing twice.
But for the above issue, I added a comment. The test HTML is badly
broken, in that you can either have a <body> OR a <frameset>, but
not both.
The HTML was broken on purpose - one of the goals of the original
test was to get as much content and links in presence of grave
errors - as you know even major sites often produce a badly broken
HTML, but the parser sanitize it and produce a valid DOM. In this
case, it produced two nested <body> elements, which is not valid.
I'll need to check this out - the response from TagSoup was <body/>
followed by the <frameset> data, and finally a closing </html>.
So if Tika is generating two bodies, then that's a bug in Tika. Though
technically, having the <frameset> following the <body> is also invalid.
I'd suggest filing a Tika issue to do a better job of handling invalid
framesets like this. Based on my experience, I don't think there would
be an easy way to get this change into TagSoup.
I should also mention that NekoHTML handled this test much better,
by removing the <body> and retaining only the <frameset>.
Yes, that's a well-known issue - certain docs are better handled by
NekoHTML, while with others you get better results from TagSoup.
Anecdotally I'd heard that NekoHTML was better at extracting links.
Tika used to use NekoHTML, but switched to TagSoup last October. One
reason was to avoid a troublesome dependency on Xerces.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g