Andrzej Bialecki wrote:
Uroš Gruber wrote:
ParseData.metadata sounds nice, but I think I'm lost again :)
If I understand code flow the best place would be in Fetcher [262]

but i'm not sure that datum holds info of url being fetched

On the input to the fetcher you get a URL and a CrawlDatum (originally coming from the crawldb). Check for example how the segment name is passed around in metadata, you can use the same method.

Hi,

I made some draft patch. But there is still some problems I see. I know code needs to be cleaned and test. But right now I don't know what number set to external urls. For internal linking works great.

What is the whole idea of this changes.

Injected urls always get hop 0. While fetching/updating/generating hop value is incremented by 1. (still no idea what to do with external link). Then I can add config value max_hop etc. to limit fetcher and generator to create more urls.

This way it's possible to limit crawling vertically

Comments are welcome.

regards,

Uros
Index: java/org/apache/nutch/crawl/CrawlDatum.java
===================================================================
--- java/org/apache/nutch/crawl/CrawlDatum.java (revision 437981)
+++ java/org/apache/nutch/crawl/CrawlDatum.java (working copy)
@@ -57,6 +57,7 @@
   private byte status;
   private long fetchTime = System.currentTimeMillis();
   private byte retries;
+  private int hop;
   private float fetchInterval;
   private float score = 1.0f;
   private byte[] signature = null;
@@ -82,6 +83,8 @@
   public byte getStatus() { return status; }
   public void setStatus(int status) { this.status = (byte)status; }
 
+  public int getHop() { return hop; }
+  public void setHop (int hop) {this.hop = hop; }
   public long getFetchTime() { return fetchTime; }
   public void setFetchTime(long fetchTime) { this.fetchTime = fetchTime; }
 
@@ -151,6 +154,7 @@
     retries = in.readByte();
     fetchInterval = in.readFloat();
     score = in.readFloat();
+    hop = in.readInt();
     if (version > 2) {
       modifiedTime = in.readLong();
       int cnt = in.readByte();
@@ -186,6 +190,7 @@
     out.writeByte(retries);
     out.writeFloat(fetchInterval);
     out.writeFloat(score);
+    out.writeInt(hop);
     out.writeLong(modifiedTime);
     if (signature == null) {
       out.writeByte(0);
@@ -210,6 +215,7 @@
     this.score = that.score;
     this.modifiedTime = that.modifiedTime;
     this.signature = that.signature;
+    this.hop = that.hop;
     this.metaData = new MapWritable(that.metaData); // make a deep copy
   }
 
@@ -290,6 +296,7 @@
     buf.append("Retries since fetch: " + getRetriesSinceFetch() + "\n");
     buf.append("Retry interval: " + getFetchInterval() + " days\n");
     buf.append("Score: " + getScore() + "\n");
+    buf.append("Hop: " + getHop() + "\n");
     buf.append("Signature: " + StringUtil.toHexString(getSignature()) + "\n");
     buf.append("Metadata: " + (metaData != null ? metaData.toString() : 
"null") + "\n");
     return buf.toString();
Index: java/org/apache/nutch/crawl/Injector.java
===================================================================
--- java/org/apache/nutch/crawl/Injector.java   (revision 437981)
+++ java/org/apache/nutch/crawl/Injector.java   (working copy)
@@ -77,6 +77,7 @@
         value.set(url);                           // collect it
         CrawlDatum datum = new CrawlDatum(CrawlDatum.STATUS_DB_UNFETCHED, 
interval);
         datum.setScore(scoreInjected);
+        datum.setHop(0);
         try {
           scfilters.initialScore(value, datum);
         } catch (ScoringFilterException e) {
Index: java/org/apache/nutch/fetcher/Fetcher.java
===================================================================
--- java/org/apache/nutch/fetcher/Fetcher.java  (revision 437981)
+++ java/org/apache/nutch/fetcher/Fetcher.java  (working copy)
@@ -260,6 +260,8 @@
       Metadata metadata = content.getMetadata();
       // add segment to metadata
       metadata.set(SEGMENT_NAME_KEY, segmentName);
+
+      metadata.set("hop", Integer.toString(datum.getHop()));
       // add score to content metadata so that ParseSegment can pick it up.
       try {
         scfilters.passScoreBeforeParsing(key, datum, content);
Index: java/org/apache/nutch/parse/ParseOutputFormat.java
===================================================================
--- java/org/apache/nutch/parse/ParseOutputFormat.java  (revision 437981)
+++ java/org/apache/nutch/parse/ParseOutputFormat.java  (working copy)
@@ -85,8 +85,8 @@
           String fromHost = null; 
           String toHost = null;          
           textOut.append(key, new ParseText(parse.getText()));
-          
           ParseData parseData = parse.getData();
+          String pd = parseData.getContentMeta().get("hop");
           // recover the signature prepared by Fetcher or ParseSegment
           String sig = parseData.getContentMeta().get(Fetcher.SIGNATURE_KEY);
           if (sig != null) {
@@ -151,6 +151,7 @@
               }
               continue;
             }
+            target.setHop(Integer.parseInt(pd)+1);
             crawlOut.append(targetUrl, target);
             if (adjust != null) crawlOut.append(key, adjust);
           }

Reply via email to