The DB_unfetched value will grow after every successful crawl. This comprises
of any left over URLs (that were not fetched) and new URLs that were on the
pages you crawled previously.
This is not an error, nor is it missing URLs (its actually finding you new
ones) but if you don't want it to function in this way you can always change
the setting "db.max.outlinks.per.page" to 0 instead of 100 (default).
The "fetcher.threads.fetch" setting has nothing to do with "what" is crawled,
but more to do with the overall speed on the crawl. Also, when you do start
that big crawl of yours, make sure its not set to 1 as this will preform poorly.
----- Original Message ----
From: cesar voulgaris <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, January 16, 2007 11:57:21 PM
Subject: DB_unfetched status
hi I.m using the 0.8.1 version. I have some problem with the
fetcher.threads.fetch property.
Setting the value of this property to 100, then crawling (depth 3)an a
readdb, I get:
CrawlDb statistics start: crawltest/crawldb/
Statistics for CrawlDb: crawltest/crawldb/
TOTAL urls: 82
retry 0: 82
min score: 0.0
avg score: 0.026024392
max score: 1.018
status 1 (DB_unfetched ): 68
status 2 (DB_fetched): 10
status 3 (DB_gone): 4
CrawlDb statistics: done
Lowering the value I get more TOTAL urls and a bigger
DB_fetched/DB_unfetched ratio (always exactly the same crawl )
Does this means that the crawler is missing urls?. What means anyway
DB_unfetched, if i try with fetcher.threads.fetch =1, I still
get lots DB_unfetched results
The hadoop.log doesn`t show any error or distintive message (in nutch
0.7.2I got https exceeded max delay messages with high values)
I preciate any comments, I do´t want to start a big crawl without knowing
the performance of the crawler...thanks
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general