[ 
https://issues.apache.org/jira/browse/NUTCH-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686424#comment-16686424
 ] 

ASF GitHub Bot commented on NUTCH-2630:
---------------------------------------

sebastian-nagel closed pull request #387: NUTCH-2630 Fetcher to log skipped 
records by robots.txt
URL: https://github.com/apache/nutch/pull/387
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/fetcher/FetcherThread.java 
b/src/java/org/apache/nutch/fetcher/FetcherThread.java
index bfcc3741e..6ba920e87 100644
--- a/src/java/org/apache/nutch/fetcher/FetcherThread.java
+++ b/src/java/org/apache/nutch/fetcher/FetcherThread.java
@@ -302,9 +302,7 @@ public void run() {
             if (!rules.isAllowed(fit.url.toString())) {
               // unblock
               ((FetchItemQueues) fetchQueues).finishFetchItem(fit, true);
-              if (LOG.isDebugEnabled()) {
-                LOG.debug("Denied by robots.txt: {}", fit.url);
-              }
+              LOG.info("Denied by robots.txt: {}", fit.url);
               output(fit.url, fit.datum, null,
                   ProtocolStatus.STATUS_ROBOTS_DENIED,
                   CrawlDatum.STATUS_FETCH_GONE);
@@ -315,7 +313,7 @@ public void run() {
               if (rules.getCrawlDelay() > maxCrawlDelay && maxCrawlDelay >= 0) 
{
                 // unblock
                 ((FetchItemQueues) fetchQueues).finishFetchItem(fit, true);
-                LOG.debug("Crawl-Delay for {} too long ({}), skipping", 
fit.url,
+                LOG.info("Crawl-Delay for {} too long ({}), skipping", fit.url,
                     rules.getCrawlDelay());
                 output(fit.url, fit.datum, null,
                     ProtocolStatus.STATUS_ROBOTS_DENIED,


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fetcher to log skipped records by robots.txt
> --------------------------------------------
>
>                 Key: NUTCH-2630
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2630
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.15
>            Reporter: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.16
>
>
> To analyze problems it would be helpful if fetcher logs URLs which are 
> disallowed in the robots.txt - see [discussion on user mailing 
> list|https://lists.apache.org/thread.html/7fe5b02104ea866aba183d009a5fad59ad4e4daf8954593ef0123dd6@%3Cuser.nutch.apache.org%3E].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to