Hi Julien,
See https://issues.apache.org/jira/browse/TIKA-457 for a description
of one of the cases found by Andrzej. There seems to be something
very wrong with the way <body> is handled, we also saw cases were it
was twice in the output.
Don't know about the case of it appearing twice.
But for the above issue, I added a comment. The test HTML is badly
broken, in that you can either have a <body> OR a <frameset>, but not
both.
-- Ken
On 7 July 2010 17:41, Ken Krugler <[email protected]> wrote:
Hi Andrzej,
I've got a old list of cases where Tika was not extracting links:
- frame
- iframe
- img
- map
- object
- link (only in <head> section)
I worked around this in my crawling code, by directly processing the
DOM, but I should roll this into Tika.
If you have a list of problems with test docs, file a TIKA issue and
I'll try to fix things up quickly.
Thanks,
-- Ken
On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:
Hi,
I'm going through NUTCH-840, and I tried to eat our own dog food,
i.e. prepare the test DOM-s with Tika's HtmlParser.
Results are not so good for some test cases... Even when using
IdentityHtmlMapper Tika ignores some elements (such as frame/
frameset) and for some others (area) it drops the href. As a result,
the number of valid outlinks collected with parse-tika is much
smaller than with parse-html.
I know this issue has been reported (TIKA-379, NUTCH-817,
NUTCH-794), and a partial fix was applied to Tika 0.8, but still
this won't handle the problems I mentioned above.
Can we come up with a plan to address this? I'd rather switch
completely to Tika-s HTML parsing, but at the moment we would lose
too much useful data...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
--
DigitalPebble Ltd
Open Source Solutions for Text Engineering
http://www.digitalpebble.com
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g