[ 
https://issues.apache.org/jira/browse/NUTCH-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13568433#comment-13568433
 ] 

Lewis John McGibbney commented on NUTCH-1525:
---------------------------------------------

Hi Lufeng.
Yes I think you are right. I would like to make sure though, please comment.
The URL is already stored in the WebDB before we check for the redirect in 
FetcherReducer#handleRedirect() (in 2.x)

{code} 
      if (ignoreExternalLinks) {
        String toHost   = new URL(newUrl).getHost().toLowerCase();
        String fromHost = new URL(url).getHost().toLowerCase();
        if (toHost == null || !toHost.equals(fromHost)) {
          // external links
          return;
        }
      }
{code}
                
> Generator to record external links even when  db.ignore.external.links set to 
> true
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-1525
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1525
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.7, 2.2
>
>
> When fetching pages from specific domains we have various options e.g. use 
> urlfilters, set the above property to true before injecting urls into the 
> webdb etc. However with the former, it is recognised that complex regex can 
> slow down processing and with the latter it means we disregard a number of 
> urls which could potentially become useful in the future.
> Unfortunately there is no way to record external links encountered for future 
> processing, although the wiki suggests that a very small patch to the 
> generator code can allow you to log these links to hadoop.log. although this 
> is better, a more robusts storage mechanism would be preferred. This may tie 
> in with custom counters we've already specified or may require new counters 
> to be implemented.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to