Sami Siren wrote:
Uroš Gruber wrote:
Andrzej Bialecki wrote:
Sami Siren (JIRA) wrote:
I am not sure to what you refer to by this 3-4 sec but yes I agree
threre are more aspects to optimize in fetcher, what I was firstly
concerned was the fetching IO speed what was getting ridiculously
low (not quite sure when this happened).
I set DEBUG level loging and I've checked time during operations and
when doint MapReduce job which is run after every page it takes 3-4
seconds till next url is fethed.
I have some local site and fetching 100 pages takes about 6 minutes.
Even I havent's seen it go that slow :)
Lucky me ;)
Depending on the number of map/reduce tasks, there is a framework
overhead to transfer the job JAR
I would like to help find what cause such slowness. Version 0.7 did
not use MapReduce and fetching was done about 20 pages per second on
the same server. With same site fetching is reduced to 0.3 pages per
second.
With queue based solution I just did a crawl of about 600k pages and
it averaged 16 pps (1376 kb/s) with parsing enabled. Perhaps you could
try Andrzejs new Fetcher and see how it performs for you (I haven't
yet read the code ot tested it my self).
I'll try it, but first I need to test it on java 1.4.2. Maybe the
problem is with OS itself. I'll report bask as soon as I have more test.
regards
Uros