[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655169#comment-16655169 ]
Markus Jelsma commented on TIKA-2760: ------------------------------------- Patch file only contains a unit test. The expected part of the text is not correct anyway, because i haven't counted the number of hyperlinks we should expect. The output of getLinks().size() is 0. > LinkContentHandler does not report hyperlinks > --------------------------------------------- > > Key: TIKA-2760 > URL: https://issues.apache.org/jira/browse/TIKA-2760 > Project: Tika > Issue Type: Bug > Affects Versions: 1.19 > Reporter: Markus Jelsma > Priority: Major > Fix For: 1.20 > > Attachments: TIKA-2760.patch, ronaldmcdonald-nolinks.html > > > Nutch uses LinkContentHandler for collection hyperlinks, and does not report > any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also > attach to this ticket. > Debugging LinkContentHandler to print element names in startElement reveals > only very few HTML elements get reported, which i think is incorrect. > Our own parser in Nutch uses a custom ContentHandler and does report many > elements, including hyperlinks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)