Hi Mike,

Your differents tests are really interesting, thanks for sharing!
I didn't do as many tests. I changed the number of fetch threads and the
number of map and reduce tasks and noticed that it gave me quite
different results in terms of pages fetched.
Then, I wanted to see if this issue would still happen when running the
crawl (single pass) on one single machine running everything locally,
without ndfs.
So I injected 50000 urls and got 2315 urls fetched.  I couldn't find a
trace in the logs of most of the urls.
I noticed that if I put a counter at the beginning of the
"/while(true)/*"* loop in the method /run/ in /Fetcher.java,/ I don't
end up with 50000!
After some poking around, I noticed that if I comment out the line doing
the page fetch "/ProtocolOutput output = protocol.getProtocolOutput(key,
datum);/", then I get 50000.
There seems to be something really wrong with that.  I seems to mean
that some threads are dying without notification in the http protocol
code (if it makes any sense).
I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient
With the old protocol I got 50000 as expected.

The following bug seems to be very similar to what we are encountering:
http://issues.apache.org/jira/browse/NUTCH-136
Check out the latest comment.  I'm gonna remove line 211 and run some
tests to see how it behaves (with protocol-http and protocol-httpclient).

I'll let you know what I find out,
--Florent

Mike Smith wrote:

>Hi Florent
>
>I did some more testings. Here is the results:
>
>I have 3 machines, P4 and 1G ram. All three are data node and one is
>namenode. I started from 80000 seed urls and tried to see the effect of
>depth 1 crawl for different configuration.
>
>Number of unfetch pages changes with different configurations:
>
>--Configuration 1
>Number of map tasks: 3
>Number of reduce tasks: 3
>Number of fetch threads: 40
>Number of thread per host: 2
>http.timeout: 10 sec
>-------------------------------
>6700 pages fetched
>
>--Configuration 2
>Number of map tasks: 12
>Number of reduce tasks: 6
>Number of fetch threads: 500
>Number of thread per host: 20
>http.timeout: 10 sec
>-------------------------------
>18000 pages fetched
>
>--Configuration 3
>Number of map tasks: 40
>Number of reduce tasks: 20
>Number of fetch threads: 500
>Number of thread per host: 20
>http.timeout: 10 sec
>-------------------------------
>37000 pages fetched
>
>--Configuration 4
>Number of map tasks: 100
>Number of reduce tasks: 20
>Number of fetch threads: 100
>Number of thread per host: 20
>http.timeout: 10 sec
>-------------------------------
>34000 pages fetched
>
>
>--Configuration 5
>Number of map tasks: 50
>Number of reduce tasks: 50
>Number of fetch threads: 40
>Number of thread per host: 100
>http.timeout: 20 sec
>-------------------------------
>52000 pages fetched
>
>--Configuration 6
>Number of map tasks: 50
>Number of reduce tasks: 100
>Number of fetch threads: 40
>Number of thread per host: 100
>http.timeout: 20 sec
>-------------------------------
>57000 pages fetched
>
>--Configuration 7
>Number of map tasks: 50
>Number of reduce tasks: 120
>Number of fetch threads: 250
>Number of thread per host: 20
>http.timeout: 20 sec
>-------------------------------
>60000 pages fetched
>
>
>
>Do you have any idea why pages are missing from the fetcher without the any
>log or exceptions? It seems it really depends on the number of reduce
>tasks!
>Thanks, Mike
>  
>

Reply via email to