Aha, clear - thank you for the explanation.

How about just some javadoc like this?

$ svn diff src/java/org/apache/nutch/crawl/CrawlDatum.java
Index: src/java/org/apache/nutch/crawl/CrawlDatum.java
===================================================================
--- src/java/org/apache/nutch/crawl/CrawlDatum.java     (revision 646194)
+++ src/java/org/apache/nutch/crawl/CrawlDatum.java     (working copy)
@@ -158,7 +158,15 @@
   
   public void setStatus(int status) { this.status = (byte)status; }
 
+  /**
+   * Returns either the time of the last fetch, or the next fetch time,
+   * depending on whether Fetcher or CrawlDbReducer set the time.
+   */
   public long getFetchTime() { return fetchTime; }
+  /**
+   * Sets either the time of the last fetch or the next fetch time,
+   * depending on whether Fetcher or CrawlDbReducer set the time.
+   */
   public void setFetchTime(long fetchTime) { this.fetchTime = fetchTime; }
 
   public long getModifiedTime() {


Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, April 10, 2008 4:39:01 AM
Subject: Re: CrawlDatum: mislabeling?

[EMAIL PROTECTED] wrote:
> Hi,
> 
> Does "Fetch time" in CrawlDatum really represent "Next fetch time"?
> 
> Example:
> The URL below was just fetched.  After that bin/nutch readdb was run:
> 
> $ bin/nutch readdb /user/foo/crawl/crawldb -url http://www.foobar.com/
> 
> URL: http://www.foobar.com/
> Version: 6
> Status: 6 (db_notmodified)
> Fetch time: Fri May 09 17:17:31 EDT 2008          <---- NOTE: 30 days from 
> now??
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 3.955374E-8
> Signature: f3ee31dcfde9ca40f4ed4a4e1bf66e24
> Metadata: _pst_:temp_moved(13), lastModified=0: http://foobar.com/
>  
> 
> Either the above "Fetch time" is off by 1 month, or the above "Fetch time" 
> should really be labeled "Next fetch fime".
> Looking at CrawlDatum, it looks like it's the later.  Is that so?

Well, this field serves two purposes, so the name is ambiguous on 
purpose (and that's probably bad ;) ). CrawlDatum class is used in many 
contexts, it's used to keep the (static) status of pages in CrawlDb, but 
it's also used during fetching / updating jobs to keep track of the 
current (changing) status of pages as they are being fetched. E.g. 
fetchers will update this field to contain the actual fetch time (so it 
no longer carries the meaning "next fetch time" in that case - instead 
its value is equal to the actual fetch time when the page as fetched). 
On the other hand, the CrawlDbReducer modifies this value to set the 
time of the next fetch, and as such it's recorded in the CrawlDb ...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Reply via email to