Hi,

I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. prepare the test DOM-s with Tika's HtmlParser.

Results are not so good for some test cases... Even when using IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) and for some others (area) it drops the href. As a result, the number of valid outlinks collected with parse-tika is much smaller than with parse-html.

I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and a partial fix was applied to Tika 0.8, but still this won't handle the problems I mentioned above.

Can we come up with a plan to address this? I'd rather switch completely to Tika-s HTML parsing, but at the moment we would lose too much useful data...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to