[
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15108025#comment-15108025
]
Otis Gospodnetic commented on NUTCH-1233:
-----------------------------------------
My opinion: better to have this in Nutch (the issue is 4 years old) and then
work on improving Tika's link extraction.... oh, which I see you already did in
TIKA-1835, so it's a matter of getting that TIKA-1835 committed and then Tika
upgraded in Nutch. +1 for committing if you ask me.
> Rely on Tika for outlink extraction
> -----------------------------------
>
> Key: NUTCH-1233
> URL: https://issues.apache.org/jira/browse/NUTCH-1233
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch,
> NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt,
> post-1233.txt, pre-1233-2.txt, pre-1233.txt
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be
> able to use it in Nutch we need Tika to return the rel attr value of each
> link, which it currently doesn't. There's a patch for Tika 1.1. If that patch
> is included in Tika and we upgraded to that new version this issue can be
> worked on. Here's preliminary code that does both Tika and current outlink
> extraction. This also includes parts of the Boilerpipe code.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)