[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671148#comment-16671148 ]
Dave Meikle commented on TIKA-2760: ----------------------------------- Hi [~markus17], I used your test but moved it in the tika-parsers project so the HtmlParser is registered, as in tika-core it is just the MockParser so I get the same results as you there. Here's a diff based on your patch: [^TIKA-2760 - Test for Outlinks.diff] I've just forked nutch and will have a wee look in parse-tika and parse-html modules. Cheers, Dave > LinkContentHandler does not report hyperlinks > --------------------------------------------- > > Key: TIKA-2760 > URL: https://issues.apache.org/jira/browse/TIKA-2760 > Project: Tika > Issue Type: Bug > Affects Versions: 1.19 > Reporter: Markus Jelsma > Priority: Major > Fix For: 1.20 > > Attachments: TIKA-2760 - Test for Outlinks.diff, TIKA-2760.patch, > ronaldmcdonald-nolinks.html > > > Nutch uses LinkContentHandler for collection hyperlinks, and does not report > any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also > attach to this ticket. > Debugging LinkContentHandler to print element names in startElement reveals > only very few HTML elements get reported, which i think is incorrect. > Our own parser in Nutch uses a custom ContentHandler and does report many > elements, including hyperlinks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)