While the fetch is in progress, or any other operation for that matter all the working information will be kept in the Hadoop temp directory. This will be "/tmp/hadoop-<username>" unless you specify something else using the "hadoop.tmp.dir" property in your hadoop-site.xml file. When the fetch is complete, and Hadoop finishes its parse-reduce stage you will then notice all the information will have been copied to the applicable segment directory.
----- Original Message ---- From: chee wu <[EMAIL PROTECTED]> To: [email protected] Sent: Wednesday, January 3, 2007 11:23:56 AM Subject: Re: nutch81 pages seems were not kept but no error message found Thanks Sean, I see.. The fetcher process only update the status in the segments,but the status of readdb is from crawldb.... Another question in this mail thread is why the size of the crawl dir,which include crawldb and segements, always remains unchanged ? The pages already fetched should be kept in segements and the size of the segements directory should also increase accordingly, is this true ? ----- Original Message ----- From: "Sean Dean" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Wednesday, January 03, 2007 11:05 PM Subject: Re: nutch81 pages seems were not kept but no error message found The Nutch DB stats (and everything else in there) will not get updated until you actually issue a "updatedb" command on a fetched segment. Nutch does not support real-time updates of this information. ----- Original Message ---- From: Chee Wu <[EMAIL PROTECTED]> To: [email protected] Sent: Wednesday, January 3, 2007 7:33:08 AM Subject: nutch81 pages seems were not kept but no error message found Hi all, I am using crawl tool in Nutch81 under cygwin,trying to retrieve pages from about 2 thousand websites,and the crawl process has been running for nearly 20 hours. But during the past 10 hours, the fetch status always remain the same as below: TOTAL urls: 165212 retry 0: 164110 retry 1: 814 retry 2: 288 min score: 0.0 avg score: 0.029228665 max score: 2.333 status 1 (DB_unfetched): 134960 status 2 (DB_fetched): 27812 status 3 (DB_gone): 2440 all the number in the status remain the same. DB_fetched page always is 27812. From the console output and hadoop.log I can see the the page fetching process is running without any error. the size of the crawl db also have no change,always be 328M. I have tried to solve this problem during all the last week. any hints for this problem is appreciated. Thanks and bow~~~
