Hi Mike, Your differents tests are really interesting, thanks for sharing! I didn't do as many tests. I changed the number of fetch threads and the number of map and reduce tasks and noticed that it gave me quite different results in terms of pages fetched. Then, I wanted to see if this issue would still happen when running the crawl (single pass) on one single machine running everything locally, without ndfs. So I injected 50000 urls and got 2315 urls fetched. I couldn't find a trace in the logs of most of the urls. I noticed that if I put a counter at the beginning of the "/while(true)/*"* loop in the method /run/ in /Fetcher.java,/ I don't end up with 50000! After some poking around, I noticed that if I comment out the line doing the page fetch "/ProtocolOutput output = protocol.getProtocolOutput(key, datum);/", then I get 50000. There seems to be something really wrong with that. I seems to mean that some threads are dying without notification in the http protocol code (if it makes any sense). I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient With the old protocol I got 50000 as expected.
The following bug seems to be very similar to what we are encountering: http://issues.apache.org/jira/browse/NUTCH-136 Check out the latest comment. I'm gonna remove line 211 and run some tests to see how it behaves (with protocol-http and protocol-httpclient). I'll let you know what I find out, --Florent Mike Smith wrote: >Hi Florent > >I did some more testings. Here is the results: > >I have 3 machines, P4 and 1G ram. All three are data node and one is >namenode. I started from 80000 seed urls and tried to see the effect of >depth 1 crawl for different configuration. > >Number of unfetch pages changes with different configurations: > >--Configuration 1 >Number of map tasks: 3 >Number of reduce tasks: 3 >Number of fetch threads: 40 >Number of thread per host: 2 >http.timeout: 10 sec >------------------------------- >6700 pages fetched > >--Configuration 2 >Number of map tasks: 12 >Number of reduce tasks: 6 >Number of fetch threads: 500 >Number of thread per host: 20 >http.timeout: 10 sec >------------------------------- >18000 pages fetched > >--Configuration 3 >Number of map tasks: 40 >Number of reduce tasks: 20 >Number of fetch threads: 500 >Number of thread per host: 20 >http.timeout: 10 sec >------------------------------- >37000 pages fetched > >--Configuration 4 >Number of map tasks: 100 >Number of reduce tasks: 20 >Number of fetch threads: 100 >Number of thread per host: 20 >http.timeout: 10 sec >------------------------------- >34000 pages fetched > > >--Configuration 5 >Number of map tasks: 50 >Number of reduce tasks: 50 >Number of fetch threads: 40 >Number of thread per host: 100 >http.timeout: 20 sec >------------------------------- >52000 pages fetched > >--Configuration 6 >Number of map tasks: 50 >Number of reduce tasks: 100 >Number of fetch threads: 40 >Number of thread per host: 100 >http.timeout: 20 sec >------------------------------- >57000 pages fetched > >--Configuration 7 >Number of map tasks: 50 >Number of reduce tasks: 120 >Number of fetch threads: 250 >Number of thread per host: 20 >http.timeout: 20 sec >------------------------------- >60000 pages fetched > > > >Do you have any idea why pages are missing from the fetcher without the any >log or exceptions? It seems it really depends on the number of reduce >tasks! >Thanks, Mike > >
