Log list of files found but not crawled.
----------------------------------------

                 Key: NUTCH-649
                 URL: https://issues.apache.org/jira/browse/NUTCH-649
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
         Environment: any
            Reporter: Jim




        I use Nutch to find the location of executables on the web, but we do 
not download the executables with Nutch.  In order to get nutch to give the 
location of files without downloading the files, I had to make a very small 
patch to the code, but I think this change might be useful to others also.  The 
patch just logs files that are being filtered at the info level, although 
perhaps it should be at the debug level.

   I have included a svn diff with this change.  Use cases would be to both use 
as a diagnostic tool (let's see what we are skipping) as well as a way to find 
content and links pointed to by a page or site without having to actually 
download that content.



Index: ParseOutputFormat.java
===================================================================
--- ParseOutputFormat.java      (revision 593619)
+++ ParseOutputFormat.java      (working copy)
@@ -193,17 +193,20 @@
                toHost = null;
              }
              if (toHost == null || !toHost.equals(fromHost)) { // external 
links
+               LOG.info("filtering externalLink " + toUrl + " linked to by " + 
fromUrl);
+
                continue; // skip it
              }
            }
            try {
              toUrl = normalizers.normalize(toUrl,
                          URLNormalizers.SCOPE_OUTLINK); // normalize the url
-              toUrl = filters.filter(toUrl);   // filter the url
-              if (toUrl == null) {
-                continue;
-              }
-            } catch (Exception e) {
+
+             if (filters.filter(toUrl) == null) {   // filter the url
+                     LOG.info("filtering content " + toUrl + " linked to by " 
+ fromUrl);
+                     continue;
+                 }
+           } catch (Exception e) {
              continue;
            }
            CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED, 
interval);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to