[
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433293#comment-13433293
]
Markus Jelsma commented on NUTCH-1233:
--------------------------------------
Hi Ken,
The rel attribute is extracted in Tika 1.1 so that's fine. I'd prefer the
whitespace being collapsed in Tika instead but it was for me quicker to fix in
in Nutch. It also retrieves more URL's in some cases, that's usually a good
thing.
It seems odd indeed to not want it collapsed by Tika, but perhaps some funky
academic research would like to measure strange HTML on the web using Tika's
outlink extractor. It could be made configurable in Tika anyway.
Perhaps we should deliver a patch for Tika.
> Rely on Tika for outlink extraction
> -----------------------------------
>
> Key: NUTCH-1233
> URL: https://issues.apache.org/jira/browse/NUTCH-1233
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch,
> NUTCH-1233-1.6-2.patch
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be
> able to use it in Nutch we need Tika to return the rel attr value of each
> link, which it currently doesn't. There's a patch for Tika 1.1. If that patch
> is included in Tika and we upgraded to that new version this issue can be
> worked on. Here's preliminary code that does both Tika and current outlink
> extraction. This also includes parts of the Boilerpipe code.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira