On Jul 8, 2010, at 12:15am, Andrzej Bialecki wrote:

On 2010-07-07 22:32, Ken Krugler wrote:
Hi Julien,

See https://issues.apache.org/jira/browse/TIKA-457 for a description
of one of the cases found by Andrzej. There seems to be something very
wrong with the way <body> is handled, we also saw cases were it was
twice in the output.

Don't know about the case of it appearing twice.

But for the above issue, I added a comment. The test HTML is badly
broken, in that you can either have a <body> OR a <frameset>, but not both.

The HTML was broken on purpose - one of the goals of the original test was to get as much content and links in presence of grave errors - as you know even major sites often produce a badly broken HTML, but the parser sanitize it and produce a valid DOM. In this case, it produced two nested <body> elements, which is not valid.

I'll need to check this out - the response from TagSoup was <body/> followed by the <frameset> data, and finally a closing </html>.

So if Tika is generating two bodies, then that's a bug in Tika. Though technically, having the <frameset> following the <body> is also invalid.

I'd suggest filing a Tika issue to do a better job of handling invalid framesets like this. Based on my experience, I don't think there would be an easy way to get this change into TagSoup.

I should also mention that NekoHTML handled this test much better, by removing the <body> and retaining only the <frameset>.

Yes, that's a well-known issue - certain docs are better handled by NekoHTML, while with others you get better results from TagSoup.

Anecdotally I'd heard that NekoHTML was better at extracting links.

Tika used to use NekoHTML, but switched to TagSoup last October. One reason was to avoid a troublesome dependency on Xerces.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to