[
https://issues.apache.org/jira/browse/NUTCH-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575596#comment-13575596
]
Lewis John McGibbney commented on NUTCH-1525:
---------------------------------------------
This issue is not about redirection, my thought was that if
db.ignore.external.links was set to true then the URLs were discarded and not
used at all within the webdb. I wanted to retain those URLs (even if they were
not being used) as they may become handy later on if for example I wished to
crawl external from some domain.
> Generator to record external links even when db.ignore.external.links set to
> true
> ----------------------------------------------------------------------------------
>
> Key: NUTCH-1525
> URL: https://issues.apache.org/jira/browse/NUTCH-1525
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Reporter: Lewis John McGibbney
> Priority: Minor
> Fix For: 1.7, 2.2
>
>
> When fetching pages from specific domains we have various options e.g. use
> urlfilters, set the above property to true before injecting urls into the
> webdb etc. However with the former, it is recognised that complex regex can
> slow down processing and with the latter it means we disregard a number of
> urls which could potentially become useful in the future.
> Unfortunately there is no way to record external links encountered for future
> processing, although the wiki suggests that a very small patch to the
> generator code can allow you to log these links to hadoop.log. although this
> is better, a more robusts storage mechanism would be preferred. This may tie
> in with custom counters we've already specified or may require new counters
> to be implemented.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira