Re: nutch81 pages seems were not kept but no error message found

Sean Dean Wed, 03 Jan 2007 15:53:30 -0800

While the fetch is in progress, or any other operation for that matter all the 
working information will be kept in the Hadoop temp directory. This will be 
"/tmp/hadoop-<username>" unless you specify something else using the 
"hadoop.tmp.dir" property in your hadoop-site.xml file.
 
When the fetch is complete, and Hadoop finishes its parse-reduce stage you will 
then notice all the information will have been copied to the applicable segment 
directory.

----- Original Message ----
From: chee wu <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, January 3, 2007 11:23:56 AM
Subject: Re: nutch81 pages seems were not kept but no error message found

Thanks Sean, I see.. The fetcher process only update the status in the 
segments,but the  status of readdb is from crawldb....
Another question in this mail thread is why the size of the crawl dir,which 
include crawldb and segements, always remains unchanged ? The pages already 
fetched should be kept in  segements and the size of the segements directory 
should also increase accordingly, is this true ?

----- Original Message ----- 
From: "Sean Dean" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, January 03, 2007 11:05 PM
Subject: Re: nutch81 pages seems were not kept but no error message found

The Nutch DB stats (and everything else in there) will not get updated until 
you actually issue a "updatedb" command on a fetched segment. Nutch does not 
support real-time updates of this information.

----- Original Message ----
From: Chee Wu <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, January 3, 2007 7:33:08 AM
Subject: nutch81 pages seems were not kept but no error message found

Hi all,
   I am using crawl tool in Nutch81 under cygwin,trying to retrieve
pages from about 2 thousand websites,and the crawl process has been
running for nearly 20 hours.
    But during the past 10 hours, the fetch status always remain the
same as below:
    TOTAL urls: 165212
    retry 0:    164110
    retry 1:    814
    retry 2:    288
    min score:  0.0
    avg score:  0.029228665
    max score:  2.333
    status 1 (DB_unfetched):    134960
    status 2 (DB_fetched):      27812
    status 3 (DB_gone): 2440
all the number in the status remain the same. DB_fetched page always
is 27812. From the console output and hadoop.log I can see the the
page fetching process is running without any error.

the size of the crawl db also have no change,always be 328M.

I have tried to solve this problem during all the last week. any hints
for this problem is appreciated. Thanks and bow~~~

Re: nutch81 pages seems were not kept but no error message found

Reply via email to