Andrzej Bialecki wrote:
Uroš Gruber wrote:
ParseData.metadata sounds nice, but I think I'm lost again :)
If I understand code flow the best place would be in Fetcher [262]
but i'm not sure that datum holds info of url being fetched
On the input to the fetcher you get a URL and a CrawlDatum (originally
coming from the crawldb). Check for example how the segment name is
passed around in metadata, you can use the same method.
Hi,
I made some draft patch. But there is still some problems I see. I know
code needs to be cleaned and test. But right now I don't know what
number set to external urls. For internal linking works great.
What is the whole idea of this changes.
Injected urls always get hop 0. While fetching/updating/generating hop
value is incremented by 1. (still no idea what to do with external
link). Then I can add config value max_hop etc. to limit fetcher and
generator to create more urls.
This way it's possible to limit crawling vertically
Comments are welcome.
regards,
Uros
Index: java/org/apache/nutch/crawl/CrawlDatum.java
===================================================================
--- java/org/apache/nutch/crawl/CrawlDatum.java (revision 437981)
+++ java/org/apache/nutch/crawl/CrawlDatum.java (working copy)
@@ -57,6 +57,7 @@
private byte status;
private long fetchTime = System.currentTimeMillis();
private byte retries;
+ private int hop;
private float fetchInterval;
private float score = 1.0f;
private byte[] signature = null;
@@ -82,6 +83,8 @@
public byte getStatus() { return status; }
public void setStatus(int status) { this.status = (byte)status; }
+ public int getHop() { return hop; }
+ public void setHop (int hop) {this.hop = hop; }
public long getFetchTime() { return fetchTime; }
public void setFetchTime(long fetchTime) { this.fetchTime = fetchTime; }
@@ -151,6 +154,7 @@
retries = in.readByte();
fetchInterval = in.readFloat();
score = in.readFloat();
+ hop = in.readInt();
if (version > 2) {
modifiedTime = in.readLong();
int cnt = in.readByte();
@@ -186,6 +190,7 @@
out.writeByte(retries);
out.writeFloat(fetchInterval);
out.writeFloat(score);
+ out.writeInt(hop);
out.writeLong(modifiedTime);
if (signature == null) {
out.writeByte(0);
@@ -210,6 +215,7 @@
this.score = that.score;
this.modifiedTime = that.modifiedTime;
this.signature = that.signature;
+ this.hop = that.hop;
this.metaData = new MapWritable(that.metaData); // make a deep copy
}
@@ -290,6 +296,7 @@
buf.append("Retries since fetch: " + getRetriesSinceFetch() + "\n");
buf.append("Retry interval: " + getFetchInterval() + " days\n");
buf.append("Score: " + getScore() + "\n");
+ buf.append("Hop: " + getHop() + "\n");
buf.append("Signature: " + StringUtil.toHexString(getSignature()) + "\n");
buf.append("Metadata: " + (metaData != null ? metaData.toString() :
"null") + "\n");
return buf.toString();
Index: java/org/apache/nutch/crawl/Injector.java
===================================================================
--- java/org/apache/nutch/crawl/Injector.java (revision 437981)
+++ java/org/apache/nutch/crawl/Injector.java (working copy)
@@ -77,6 +77,7 @@
value.set(url); // collect it
CrawlDatum datum = new CrawlDatum(CrawlDatum.STATUS_DB_UNFETCHED,
interval);
datum.setScore(scoreInjected);
+ datum.setHop(0);
try {
scfilters.initialScore(value, datum);
} catch (ScoringFilterException e) {
Index: java/org/apache/nutch/fetcher/Fetcher.java
===================================================================
--- java/org/apache/nutch/fetcher/Fetcher.java (revision 437981)
+++ java/org/apache/nutch/fetcher/Fetcher.java (working copy)
@@ -260,6 +260,8 @@
Metadata metadata = content.getMetadata();
// add segment to metadata
metadata.set(SEGMENT_NAME_KEY, segmentName);
+
+ metadata.set("hop", Integer.toString(datum.getHop()));
// add score to content metadata so that ParseSegment can pick it up.
try {
scfilters.passScoreBeforeParsing(key, datum, content);
Index: java/org/apache/nutch/parse/ParseOutputFormat.java
===================================================================
--- java/org/apache/nutch/parse/ParseOutputFormat.java (revision 437981)
+++ java/org/apache/nutch/parse/ParseOutputFormat.java (working copy)
@@ -85,8 +85,8 @@
String fromHost = null;
String toHost = null;
textOut.append(key, new ParseText(parse.getText()));
-
ParseData parseData = parse.getData();
+ String pd = parseData.getContentMeta().get("hop");
// recover the signature prepared by Fetcher or ParseSegment
String sig = parseData.getContentMeta().get(Fetcher.SIGNATURE_KEY);
if (sig != null) {
@@ -151,6 +151,7 @@
}
continue;
}
+ target.setHop(Integer.parseInt(pd)+1);
crawlOut.append(targetUrl, target);
if (adjust != null) crawlOut.append(key, adjust);
}