Hi Julien,

See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way <body> is handled, we also saw cases were it was twice in the output.

Don't know about the case of it appearing twice.

But for the above issue, I added a comment. The test HTML is badly broken, in that you can either have a <body> OR a <frameset>, but not both.

-- Ken

On 7 July 2010 17:41, Ken Krugler <[email protected]> wrote:
Hi Andrzej,

I've got a old list of cases where Tika was not extracting links:

 - frame
 - iframe
 - img
 - map
 - object
 - link (only in <head> section)

I worked around this in my crawling code, by directly processing the DOM, but I should roll this into Tika.

If you have a list of problems with test docs, file a TIKA issue and I'll try to fix things up quickly.

Thanks,

-- Ken


On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:

Hi,

I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. prepare the test DOM-s with Tika's HtmlParser.

Results are not so good for some test cases... Even when using IdentityHtmlMapper Tika ignores some elements (such as frame/ frameset) and for some others (area) it drops the href. As a result, the number of valid outlinks collected with parse-tika is much smaller than with parse-html.

I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and a partial fix was applied to Tika 0.8, but still this won't handle the problems I mentioned above.

Can we come up with a plan to address this? I'd rather switch completely to Tika-s HTML parsing, but at the moment we would lose too much useful data...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







--
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to