[ 
https://issues.apache.org/jira/browse/NUTCH-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Cherniachenko updated NUTCH-1525:
----------------------------------------

    Attachment: nutch-logExternal.patch

Attached the patch for Nutch 1.7

With it applied you can add the following to log4j.properties
{code}
log4j.logger.org.apache.nutch.parse.ParseOutputFormat.externalLinks=INFO,extlinks

log4j.appender.extlinks=org.apache.log4j.DailyRollingFileAppender
log4j.appender.extlinks.File=${hadoop.log.dir}/external-links.log
log4j.appender.extlinks.DatePattern=.yyyy-MM-dd
log4j.appender.extlinks.layout=org.apache.log4j.PatternLayout
log4j.appender.extlinks.layout.ConversionPattern=%m%n
{code}

And then all the ignored external links will be logged cleanly to 
external-links.log

> Generator to record external links even when  db.ignore.external.links set to 
> true
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-1525
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1525
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: nutch-logExternal.patch
>
>
> When fetching pages from specific domains we have various options e.g. use 
> urlfilters, set the above property to true before injecting urls into the 
> webdb etc. However with the former, it is recognised that complex regex can 
> slow down processing and with the latter it means we disregard a number of 
> urls which could potentially become useful in the future.
> Unfortunately there is no way to record external links encountered for future 
> processing, although the wiki suggests that a very small patch to the 
> generator code can allow you to log these links to hadoop.log. although this 
> is better, a more robusts storage mechanism would be preferred. This may tie 
> in with custom counters we've already specified or may require new counters 
> to be implemented.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to