Thanks for your answers Doug, it makes more sense now. I'm still puzzled about why the number of DB_fetched changes so much when using different number for the map/reduce task settings. I'm gonna inspect the logs and see if I can track down what's going on. Also, I tried to use protocol-http rather than protocol-httpclient, but it didn't make any difference.
Thanks, Florent Doug Cutting wrote: > Florent Gluck wrote: > >> When I inject 25000 urls and fetch them (depth = 1) and do a readdb >> -stats, I get: >> 060110 171347 Statistics for CrawlDb: crawldb >> 060110 171347 TOTAL urls: 27939 >> 060110 171347 avg score: 1.011 >> 060110 171347 max score: 8.883 >> 060110 171347 min score: 1.0 >> 060110 171347 retry 0: 26429 >> 060110 171347 retry 1: 1510 >> 060110 171347 status 1 (DB_unfetched): 24248 >> 060110 171347 status 2 (DB_fetched): 3390 >> 060110 171347 status 3 (DB_gone): 301 >> 060110 171347 CrawlDb statistics: done >> >> There are several things that don't make sense to me and it would be >> great if someone could clear this up: >> >> 1. >> If I compute the number of occurences of "fetching" in all of my slaves' >> tasktracker logs, I get: 6225 >> This number clearly doesn't match the DB_fetched of 3390 from the >> readdb output. Why is that ? >> What happened to the 6225-3390=2835 missing urls? > > > How many errors are you seeing while fetching? Are you getting, e.g., > lots of timeouts or "max delays exceeded"? > > You might also try using protocol-http rather than > protocol-httpclient. Others have reported under-fetching issues with > protocol-httpclient. > >> 2. >> Why is the TOTAL urls: 27939 if I inject a file with 25000 entries? >> Why is it not 25000 ? > > > When the crawl db is updated it adds pages linked to by fetched pages, > with status DB_unfetched. > >> 3. >> What is the meaning of DB_gone and DB_unfetched? >> I was assuming if you inject a total of 25k urls where 5000 are >> fetchable ones, you would get something like: >> (DB_unfetched): 20000 >> (DB_fetched): 5000 >> It's not the case, so I'd like to understand what's exactly going on >> here. >> Also, what is the meaning of DB_gone ? > > > DB_gone means that a 404 or some other presumably permanent error was > encountered. This status prevents future attempts to fetch a url. > >> 4. >> If I redo (starting from an empty crawldb of course) the exact same >> inject + crawl with the same 25000 urls, but I use the following mapred >> settings instead: mapred.map.tasks=200 and mapred.reduce.tasks=8, I >> get the following readdb output: >> 060110 162140 TOTAL urls: 33173 >> 060110 162140 avg score: 1.026 >> 060110 162140 max score: 22.083 >> 060110 162140 min score: 1.0 >> 060110 162140 retry 0: 28381 >> 060110 162140 retry 1: 4792 >> 060110 162140 status 1 (DB_unfetched): 23136 >> 060110 162140 status 2 (DB_fetched): 9234 >> 060110 162140 status 3 (DB_gone): 803 >> 060110 162140 CrawlDb statistics: done >> How come the DB_fetched is about 3x more than earlier and the TOTAL >> urls goes >> way beyond the 27939 from before? >> It doesn't make any sense. I'd expect to see similar results as before >> with the other mapred settings. > > > Please look at your fetcher errors. More, smaller fetch lists means > that each fetcher task has fewer unique hosts. I'd actually expect > fewer pages to succeed, but only an analysis of your fetcher errors > will fully explain this. > > Again, the reason that the total is higher is that it includes new > urls discovered. > > Doug >
