I've written a really small patch for org.apache.nutch.crawl.Injector
which allows the plugin author to force the injected url to overwrite
any existing url. I've not submitted anything to JIRA before, is this
worth it and if so, how should I go about it?

Index: src/java/org/apache/nutch/crawl/Injector.java
===================================================================
--- src/java/org/apache/nutch/crawl/Injector.java (release version)
+++ src/java/org/apache/nutch/crawl/Injector.java (patched version)
@@ -41,8 +41,8 @@
 * crawled.  Useful for bootstrapping the system. */
public class Injector extends ToolBase {
  public static final Log LOG = LogFactory.getLog(Injector.class);
+  public static final Text OVERWRITE_INJECT = new
Text("nutch.crawl.overrideInject");

-
  /** Normalize and filter injected urls. */
  public static class InjectMapper implements Mapper {
    private URLNormalizers urlNormalizers;
@@ -116,9 +116,17 @@
          old = val;
        }
      }
+
+      boolean isOverwrite = false;
+      if(injected!=null)
+       if(injected.getMetaData().containsKey(Injector.OVERWRITE_INJECT))
+         isOverwrite =
((BooleanWritable)injected.getMetaData().get(Injector.OVERWRITE_INJECT)).get();
+
      CrawlDatum res = null;
-      if (old != null) res = old; // don't overwrite existing value
-      else res = injected;
+      if ( old != null && !isOverwrite )
+     res = old; // don't overwrite existing value
+      else
+     res = injected;

      output.collect(key, res);
    }

Cheers
Rob

Reply via email to