Andrzej Bialecki wrote:
Sami Siren (JIRA) wrote:
I am not sure to what you refer to by this 3-4 sec but yes I agree
threre are more aspects to optimize in fetcher, what I was firstly
concerned was the fetching IO speed what was getting ridiculously low
(not quite sure when this happened).
I set DEBUG level loging and I've checked time during operations and
when doint MapReduce job which is run after every page it takes 3-4
seconds till next url is fethed.
I have some local site and fetching 100 pages takes about 6 minutes.
Depending on the number of map/reduce tasks, there is a framework
overhead to transfer the job JAR file, and start the subprocess on
each tasktracker. However, once these are started the framework's
overhead should be negligible, because single task is responsible for
fetching many urls.
Naturally, for small jobs, with very few urls, the overhead is
relatively large.
The symptoms I'm seeing is that eventually most threads end up in
blockAddr spin-waiting. Another problem I see is that when the number
of fetching threads is high relative to the available bandwidth, the
data is trickling in so slowly that the Fetcher.run() decides that
it's hung, and aborts the task. What happens then is that the task
gets a SUCCEEDED status in tasktracker, although in reality it may
have fetched only a small portion of the allotted fetchlist.
I would like to help find what cause such slowness. Version 0.7 did not
use MapReduce and fetching was done about 20 pages per second on the
same server. With same site fetching is reduced to 0.3 pages per second.
here is log msg
2006-08-02 10:12:29,162 INFO mapred.LocalJobRunner - 37 pages, 0 errors, 0.3
pages/s, 52 kb/s,
2006-08-02 10:12:30,164 INFO mapred.LocalJobRunner - 37 pages, 0 errors, 0.3
pages/s, 52 kb/s,
2006-08-02 10:12:31,166 INFO mapred.LocalJobRunner - 37 pages, 0 errors, 0.3
pages/s, 51 kb/s,
2006-08-02 10:12:32,168 INFO mapred.LocalJobRunner - 37 pages, 0 errors, 0.3
pages/s, 51 kb/s,
2006-08-02 10:12:33,170 INFO mapred.LocalJobRunner - 37 pages, 0 errors, 0.3
pages/s, 50 kb/s,
We should open more than one ticket to track these separate aspects.
And for general discussion the mailing lista are perhaps the best place.
(I'm moving this to the list then).
regards
Uros