[jira] [Updated] (NUTCH-1525) Generator to record external links even when db.ignore.external.links set to true

Lewis John McGibbney (JIRA) Sun, 27 Jan 2013 10:45:13 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lewis John McGibbney updated NUTCH-1525:
----------------------------------------

    Description: 
When fetching pages from specific domains we have various options e.g. use 
urlfilters, set the above property to true before injecting urls into the webdb 
etc. However with the former, it is recognised that complex regex can slow down 
processing and with the latter it means we disregard a number of urls which 
could potentially become useful in the future.
Unfortunately there is no way to record external links encountered for future 
processing, although the wiki suggests that a very small patch to the generator 
code can allow you to log these links to hadoop.log. although this is better, a 
more robusts storage mechanism would be preferred. This may tie in with custom 
counters we've already specified or may require new counters to be implemented. 
 

  was:
When fetching pages from specific domains we have various options e.g. use 
urlfilters, set the above property to true before injecting urls into the webdb 
etc. However with the former, it is recognised that complex regex can slow down 
processing and with the latter it means we disregard a number of urls which 
could potentially become useful in the future.
Unfortunately there is no way to record external links encountered for future 
processing, althoughthe  wiki a very small patch to the generator code can 
allow you to log these links to hadoop.log.

    
> Generator to record external links even when  db.ignore.external.links set to 
> true
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-1525
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1525
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.7, 2.2
>
>
> When fetching pages from specific domains we have various options e.g. use 
> urlfilters, set the above property to true before injecting urls into the 
> webdb etc. However with the former, it is recognised that complex regex can 
> slow down processing and with the latter it means we disregard a number of 
> urls which could potentially become useful in the future.
> Unfortunately there is no way to record external links encountered for future 
> processing, although the wiki suggests that a very small patch to the 
> generator code can allow you to log these links to hadoop.log. although this 
> is better, a more robusts storage mechanism would be preferred. This may tie 
> in with custom counters we've already specified or may require new counters 
> to be implemented.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1525) Generator to record external links even when db.ignore.external.links set to true

Reply via email to