Sami Siren wrote:
carmmello wrote:
So, I think, one of the possibilities for the user of a single
machine is that the Nutch developers could use some of their time do
improve the previous 0.7.2, adding to it some new features, with
further releases of this series. I don`t belive that there are many
Nutch users, in the real world of searching, with a farm of
computers. I, for myself, have already built an index of more than
one million pages in a single machine, with an somewhat old Atlhon
2.4+ and 1 gig of memory, using the 0.7.2 version, with very good
results, including the actual searching, and gave up the same task,
using the 0.8 version, because of the large amount of time required,
time that I did not have, to complete all the tasks, after the
fetching of the pages.
How fast do you need to go?
I did a 1 million page crawl today with trunk version of nutch patched
with NUTCH-395 [1]. total time for fetching was little over 7 hrs.
How is that even possible.
I have 3.2GHz pentium with 2G ram. I was same speed problem, because of
that I setup nutch with single node. About hour ago fetcher was finished
crawling 1.2 million pages. But this took
30 hours
Map 2
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=all>
2
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=SUCCESS>
0
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=FAILED>
12-Nov-2006 15:10:35 13-Nov-2006 05:22:16 (14hrs, 11mins, 41sec)
Reduce 2
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=all>
2
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=SUCCESS>
0
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=FAILED>
12-Nov-2006 15:10:46 13-Nov-2006 21:59:19 (30hrs, 48mins, 33sec)
while map job I have about 24 pages/s. I din't test it with this patch.
But then reduce job was slow as hell. I realy don't understant what took
so long. It is almost twice as slow as map job.
I think we need to work on that part.
If I use local mode numbers are even worse.
I can't imagine how much it took to crawl let say 10mio pages.
I would like to help making nutch faster, but there is some part I don't
quite understand. I need to work on that first.
regards
Uros
But of course there are still various ways to optimize fetching
process - for example optimizing the scheduling of urls to fetch,
improving nutch agent to use Accept header [2] for failing fast on
content it cannot handle etc.
[1]http://issues.apache.org/jira/browse/NUTCH-395
[2]http://www.mail-archive.com/[email protected]/msg04344.html
--
Sami Siren