Re: Parse-tika ignores too much data...

Ken Krugler Thu, 08 Jul 2010 10:45:12 -0700


On Jul 8, 2010, at 12:15am, Andrzej Bialecki wrote:

On 2010-07-07 22:32, Ken Krugler wrote:
Hi Julien,
See https://issues.apache.org/jira/browse/TIKA-457 for a description
of one of the cases found by Andrzej. There seems to be somethingvery
wrong with the way <body> is handled, we also saw cases were it was
twice in the output.
Don't know about the case of it appearing twice.

But for the above issue, I added a comment. The test HTML is badly
broken, in that you can either have a <body> OR a <frameset>, butnot both.
The HTML was broken on purpose - one of the goals of the originaltest was to get as much content and links in presence of graveerrors - as you know even major sites often produce a badly brokenHTML, but the parser sanitize it and produce a valid DOM. In thiscase, it produced two nested <body> elements, which is not valid.

I'll need to check this out - the response from TagSoup was <body/>followed by the <frameset> data, and finally a closing </html>.

So if Tika is generating two bodies, then that's a bug in Tika. Thoughtechnically, having the <frameset> following the <body> is also invalid.

I'd suggest filing a Tika issue to do a better job of handling invalidframesets like this. Based on my experience, I don't think there wouldbe an easy way to get this change into TagSoup.

I should also mention that NekoHTML handled this test much better,by removing the <body> and retaining only the <frameset>.

Yes, that's a well-known issue - certain docs are better handled byNekoHTML, while with others you get better results from TagSoup.


Anecdotally I'd heard that NekoHTML was better at extracting links.

Tika used to use NekoHTML, but switched to TagSoup last October. Onereason was to avoid a troublesome dependency on Xerces.


-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Parse-tika ignores too much data...

Reply via email to