Thanks Sean, I see.. The fetcher process only update the status in the segments,but the status of readdb is from crawldb.... Another question in this mail thread is why the size of the crawl dir,which include crawldb and segements, always remains unchanged ? The pages already fetched should be kept in segements and the size of the segements directory should also increase accordingly, is this true ?
----- Original Message ----- From: "Sean Dean" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Wednesday, January 03, 2007 11:05 PM Subject: Re: nutch81 pages seems were not kept but no error message found The Nutch DB stats (and everything else in there) will not get updated until you actually issue a "updatedb" command on a fetched segment. Nutch does not support real-time updates of this information. ----- Original Message ---- From: Chee Wu <[EMAIL PROTECTED]> To: [email protected] Sent: Wednesday, January 3, 2007 7:33:08 AM Subject: nutch81 pages seems were not kept but no error message found Hi all, I am using crawl tool in Nutch81 under cygwin,trying to retrieve pages from about 2 thousand websites,and the crawl process has been running for nearly 20 hours. But during the past 10 hours, the fetch status always remain the same as below: TOTAL urls: 165212 retry 0: 164110 retry 1: 814 retry 2: 288 min score: 0.0 avg score: 0.029228665 max score: 2.333 status 1 (DB_unfetched): 134960 status 2 (DB_fetched): 27812 status 3 (DB_gone): 2440 all the number in the status remain the same. DB_fetched page always is 27812. From the console output and hadoop.log I can see the the page fetching process is running without any error. the size of the crawl db also have no change,always be 328M. I have tried to solve this problem during all the last week. any hints for this problem is appreciated. Thanks and bow~~~ ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
