Re: [Nutch-general] Strategic Direction of Nutch

Uroš Gruber Mon, 13 Nov 2006 13:52:14 -0800

Sami Siren wrote:
> carmmello wrote:
>> So, I think, one of the possibilities for the user of a single 
>> machine is that the Nutch developers could use some of their time do 
>> improve the previous 0.7.2, adding to it some new features, with 
>> further releases of this series.  I don`t belive that there are many 
>> Nutch users, in the real world of searching, with a farm of 
>> computers.  I, for myself, have already built an index of more than 
>> one million pages in a single machine, with an somewhat old Atlhon 
>> 2.4+ and 1 gig of memory, using the 0.7.2 version, with very good 
>> results, including the actual searching,  and gave up the same task, 
>> using the 0.8 version, because of the large amount of time required, 
>> time that I did not have,  to complete all the tasks, after the 
>> fetching of the pages.
>
> How fast do you need to go?
>
> I did a 1 million page crawl today with trunk version of nutch patched 
> with NUTCH-395 [1]. total time for fetching was little over 7 hrs.
>
How is that even possible.


I have 3.2GHz pentium with 2G ram. I was same speed problem, because of 
that I setup nutch with single node. About hour ago fetcher was finished 
crawling 1.2 million pages. But this took

30 hours

Map     2 
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=all>
 
        2 
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=SUCCESS>
 
        0 
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=FAILED>
 
        12-Nov-2006 15:10:35    13-Nov-2006 05:22:16 (14hrs, 11mins, 41sec)
Reduce  2 
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=all>
 
        2 
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=SUCCESS>
 
        0 
<http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=FAILED>
 
        12-Nov-2006 15:10:46    13-Nov-2006 21:59:19 (30hrs, 48mins, 33sec)


while map job I have about 24 pages/s. I din't test it with this patch. 
But then reduce job was slow as hell. I realy don't understant what took 
so long. It is almost twice as slow as map job.

I think we need to work on that part.

If I use local mode numbers are even worse.

I can't imagine how much it took to crawl let say 10mio pages.

I would like to help making nutch faster, but there is some part I don't 
quite understand. I need to work on that first.

regards

Uros
> But of course there are still various ways to optimize fetching 
> process - for example optimizing the scheduling of urls to fetch, 
> improving nutch agent to use Accept header [2] for failing fast on 
> content it cannot handle etc.
>
> [1]http://issues.apache.org/jira/browse/NUTCH-395
> [2]http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg04344.html
>
> -- 
>  Sami Siren


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Strategic Direction of Nutch

Reply via email to