Here is some general comments: The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206 is not solved..Have a look.
http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html Well, again its a wishful thinking to ask for many developers, patch and bug reporting and bug fixes - without focusing on the need of such developers. Same example again! hadoop-206 was reported and it is still not solved. So how do you expect to get more developers? when the developer just have 1 machine and it takes 3 days to perform any serious testing/fetching/indexing or any sort development? Developers moves on... See when the focus of the development is to solve 1000 machine/ large install, then the issues like 206 is never solved. Thus asking for more developer to provide bug fixes is a wishful thinking. Sorry if I knew how to solve map/reduce problem i would fix it and submit patch and I am sure I am not the only one here. Map/reduce stuff is not really walk in the park :-). The current direction of nutch development is geared towards large install and its a great software. However lets not pretend/preach Nutch is good for small install, Nutch left that life when it embraced Map/Reduce i.e. starting from 0.8. Regards, On 11/13/06, Uroš Gruber <[EMAIL PROTECTED]> wrote:
Sami Siren wrote: > carmmello wrote: >> So, I think, one of the possibilities for the user of a single >> machine is that the Nutch developers could use some of their time do >> improve the previous 0.7.2, adding to it some new features, with >> further releases of this series. I don`t belive that there are many >> Nutch users, in the real world of searching, with a farm of >> computers. I, for myself, have already built an index of more than >> one million pages in a single machine, with an somewhat old Atlhon >> 2.4+ and 1 gig of memory, using the 0.7.2 version, with very good >> results, including the actual searching, and gave up the same task, >> using the 0.8 version, because of the large amount of time required, >> time that I did not have, to complete all the tasks, after the >> fetching of the pages. > > How fast do you need to go? > > I did a 1 million page crawl today with trunk version of nutch patched > with NUTCH-395 [1]. total time for fetching was little over 7 hrs. > How is that even possible. I have 3.2GHz pentium with 2G ram. I was same speed problem, because of that I setup nutch with single node. About hour ago fetcher was finished crawling 1.2 million pages. But this took 30 hours Map 2 <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=all> 2 <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=SUCCESS> 0 <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=FAILED> 12-Nov-2006 15:10:35 13-Nov-2006 05:22:16 (14hrs, 11mins, 41sec) Reduce 2 <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=all> 2 <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=SUCCESS> 0 <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=FAILED> 12-Nov-2006 15:10:46 13-Nov-2006 21:59:19 (30hrs, 48mins, 33sec) while map job I have about 24 pages/s. I din't test it with this patch. But then reduce job was slow as hell. I realy don't understant what took so long. It is almost twice as slow as map job. I think we need to work on that part. If I use local mode numbers are even worse. I can't imagine how much it took to crawl let say 10mio pages. I would like to help making nutch faster, but there is some part I don't quite understand. I need to work on that first. regards Uros > But of course there are still various ways to optimize fetching > process - for example optimizing the scheduling of urls to fetch, > improving nutch agent to use Accept header [2] for failing fast on > content it cannot handle etc. > > [1]http://issues.apache.org/jira/browse/NUTCH-395 > [2]http://www.mail-archive.com/[email protected]/msg04344.html > > -- > Sami Siren
