Hi,
I'm going through NUTCH-840, and I tried to eat our own dog food, i.e.
prepare the test DOM-s with Tika's HtmlParser.
Results are not so good for some test cases... Even when using
IdentityHtmlMapper Tika ignores some elements (such as frame/frameset)
and for some others (area) it drops the href. As a result, the number of
valid outlinks collected with parse-tika is much smaller than with
parse-html.
I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794),
and a partial fix was applied to Tika 0.8, but still this won't handle
the problems I mentioned above.
Can we come up with a plan to address this? I'd rather switch completely
to Tika-s HTML parsing, but at the moment we would lose too much useful
data...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com