Re: Can we add this to nutch?

Dennis Kubes Fri, 09 Nov 2007 15:42:12 -0800

The best way to get this included is to submit a JIRA ticket and includeyour patch below. One or more of the commiters, time allowing, willthen take a look at your patch for inclusion.


Dennis Kubes


misc wrote:

Hi all-

    I asked for this before but no one answered, so I will try again.

    I have included a svn diff with a small proposed change to the code that 
would allow users to track found but filtered content in the crawl.  This is 
useful both as a diagnostic tool (let's see what we are skipping) as well as a 
way to find content and links pointed to by a page or site without having to 
actually download that content.

    I have set the log level to info, perhaps it should be debug.

    I think this would be a useful addition for many users.

    Could someone make this change?  If I am misunderstanding something and 
there are better ways to already do this, what are they?

                        see you
                            -Jim


Index: ParseOutputFormat.java
===================================================================
--- ParseOutputFormat.java      (revision 593619)
+++ ParseOutputFormat.java      (working copy)
@@ -193,17 +193,20 @@
                 toHost = null;
               }
               if (toHost == null || !toHost.equals(fromHost)) { // external 
links
+               LOG.info("filtering externalLink " + toUrl + " linked to by " + 
fromUrl);
+
                 continue; // skip it
               }
             }
             try {
               toUrl = normalizers.normalize(toUrl,
                           URLNormalizers.SCOPE_OUTLINK); // normalize the url
-              toUrl = filters.filter(toUrl);   // filter the url
-              if (toUrl == null) {
-                continue;
-              }
-            } catch (Exception e) {
+
+             if (filters.filter(toUrl) == null) {   // filter the url
+                     LOG.info("filtering content " + toUrl + " linked to by " 
+ fromUrl);
+                     continue;
+                 }
+           } catch (Exception e) {
               continue;
             }
             CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED, 
interval);

Re: Can we add this to nutch?

Reply via email to