Andrzej,
Thanks for your response and patch. But I have a few more questions about
adaptive refetch. As far as I understood the solution below is 'not to
overwrite
some fields of the entries' in the db. Assume we applied the adaptive
refetch idea in your patch to the 0.7 version. We have the same
redirection problem there too.
What do you think is the best way to solve this problem there in version
0.7?
Thanks...
Mehmet
Andrzej Bialecki wrote:
Andrzej Bialecki wrote:
Mehmet Tan wrote:
Hi,
I want to ask a question about redirections. Correct me if I'm wrong
but if a page is redirected to a page that is already in the webdb,
then the
next updatedb operation will overwrite all previous info about refetch,
because it is a newly created page in the fetcher whose
fetchInterval is the initial
fetch interval. How does the adaptive refetch algorithm handle this
situation?
Yes, this is a bug, and it affects both the original and the patched
versions - fetch interval shouldn't be blindly copied from any new
CrawlDatum (this happens in CrawlDbReducer.java:86 in both versions),
instead it should be initialized with the value from
old.getFetchInterval(), if available. Please fix this in your
version, I'll fix this in the un-patched version.
Thanks for spotting this!
Please check the attached patch, it should properly copy all original
values first, and then only update those that are necessary.
------------------------------------------------------------------------
Index: CrawlDbReducer.java
===================================================================
--- CrawlDbReducer.java (revision 389791)
+++ CrawlDbReducer.java (working copy)
@@ -61,38 +61,38 @@
}
}
- CrawlDatum result = null;
+ CrawlDatum result = new CrawlDatum();
+ // initialize with previous values, also copy metadata from old
+ // and overlay them with new metadata
+ if (old != null) {
+ result.set(old);
+ result.getMetaData().putAll(highest.getMetaData());
+ } else {
+ result.set(highest);
+ }
switch (highest.getStatus()) { // determine new status
case CrawlDatum.STATUS_DB_UNFETCHED: // no new entry
case CrawlDatum.STATUS_DB_FETCHED:
case CrawlDatum.STATUS_DB_GONE:
- result = old; // use old
+ // use old
+ result = old;
break;
case CrawlDatum.STATUS_LINKED: // highest was link
- if (old != null) { // if old exists
- result = old; // use it
- } else {
- result = highest; // use new entry
+ if (old == null) {
result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
- result.setScore(1.0f); // initial score is 1.0f
}
- result.setSignature(null); // reset the signature
break;
case CrawlDatum.STATUS_FETCH_SUCCESS: // succesful fetch
- result = highest; // use new entry
- if (highest.getSignature() == null) highest.setSignature(signature);
+ if (highest.getSignature() == null) result.setSignature(signature);
result.setStatus(CrawlDatum.STATUS_DB_FETCHED);
result.setNextFetchTime();
break;
case CrawlDatum.STATUS_FETCH_RETRY: // temporary failure
- result = highest; // use new entry
- if (old != null)
- result.setSignature(old.getSignature()); // use old signature
if (highest.getRetriesSinceFetch() < retryMax) {
result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
} else {
@@ -101,9 +101,6 @@
break;
case CrawlDatum.STATUS_FETCH_GONE: // permanent failure
- result = highest; // use new entry
- if (old != null)
- result.setSignature(old.getSignature()); // use old signature
result.setStatus(CrawlDatum.STATUS_DB_GONE);
break;
@@ -111,10 +108,8 @@
throw new RuntimeException("Unknown status: "+highest.getStatus());
}
- if (result != null) {
- result.setScore(result.getScore() + scoreIncrement);
- output.collect(key, result);
- }
+ result.setScore(result.getScore() + scoreIncrement);
+ output.collect(key, result);
}
}
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general