Hi Florent I did some more testings. Here is the results:
I have 3 machines, P4 and 1G ram. All three are data node and one is namenode. I started from 80000 seed urls and tried to see the effect of depth 1 crawl for different configuration. Number of unfetch pages changes with different configurations: --Configuration 1 Number of map tasks: 3 Number of reduce tasks: 3 Number of fetch threads: 40 Number of thread per host: 2 http.timeout: 10 sec ------------------------------- 6700 pages fetched --Configuration 2 Number of map tasks: 12 Number of reduce tasks: 6 Number of fetch threads: 500 Number of thread per host: 20 http.timeout: 10 sec ------------------------------- 18000 pages fetched --Configuration 3 Number of map tasks: 40 Number of reduce tasks: 20 Number of fetch threads: 500 Number of thread per host: 20 http.timeout: 10 sec ------------------------------- 37000 pages fetched --Configuration 4 Number of map tasks: 100 Number of reduce tasks: 20 Number of fetch threads: 100 Number of thread per host: 20 http.timeout: 10 sec ------------------------------- 34000 pages fetched --Configuration 5 Number of map tasks: 50 Number of reduce tasks: 50 Number of fetch threads: 40 Number of thread per host: 100 http.timeout: 20 sec ------------------------------- 52000 pages fetched --Configuration 6 Number of map tasks: 50 Number of reduce tasks: 100 Number of fetch threads: 40 Number of thread per host: 100 http.timeout: 20 sec ------------------------------- 57000 pages fetched --Configuration 7 Number of map tasks: 50 Number of reduce tasks: 120 Number of fetch threads: 250 Number of thread per host: 20 http.timeout: 20 sec ------------------------------- 60000 pages fetched Do you have any idea why pages are missing from the fetcher without the any log or exceptions? It seems it really depends on the number of reduce tasks! Thanks, Mike On 1/17/06, Mike Smith <[EMAIL PROTECTED]> wrote: > > I've experienced the same effect. When I decrease number of map/reduce > tasks, I can fetch more web pages. but increasing those increases unfetched > pages. I also get some "java.net.SocketTimeoutException : Read timed out" > exceptions in my datanode log files. But those time out problems couldn't > cause this much missing pages!! I agree the problem should be somewhere is > the fetcher. > > Mike > > > On 1/17/06, Florent Gluck <[EMAIL PROTECTED]> wrote: > > > > I'm having the exact same problem. > > I noticed that changing the number of map/reduce tasks gives me > > different DB_fetched results. > > Looking at the logs, a lot of urls are actually missing. I can't find > > their trace *anywhere* in the logs (whether on the slaves or the > > master). I'm puzzled. Currently I'm trying to debug the code to see > > what's going on. > > So far, I noticed the generator is fine, so the issue must lay further > > in the pipeline (fetcher?). > > > > Let me know if you find anything regarding this issue. Thanks. > > > > --Flo > > > > Mike Smith wrote: > > > > >Hi, > > > > > >I have setup for boxes using MapReduce, everything goes smoothly, I > > have > > >feeded about 80000 seed nodes for begining and I have crawled by depth > > 2. > > >Only 1900 pages (about 300MG) data and the rest is marked and db > > unfetched. > > >Does any one know what could be wrong? > > > > > >This is the output of (bin/nutch readdb h2/crawldb -stats): > > > > > >060115 171625 Statistics for CrawlDb: h2/crawldb > > >060115 171625 TOTAL urls: 99403 > > >060115 171625 avg score: 1.01 > > >060115 171625 max score: 7.382 > > >060115 171625 min score: 1.0 > > >060115 171625 retry 0: 99403 > > >060115 171625 status 1 (DB_unfetched): 97470 > > >060115 171625 status 2 (DB_fetched): 1933 > > >060115 171625 CrawlDb statistics: done > > > > > >Thanks, > > >Mike > > > > > > > > > > > > > >
