Re: Strategic Direction of Nutch

Sami Siren Mon, 13 Nov 2006 10:28:50 -0800

carmmello wrote:

So, I think, one of the possibilities for the user of a single machineis that the Nutch developers could use some of their time do improve theprevious 0.7.2, adding to it some new features, with further releases ofthis series. I don`t belive that there are many Nutch users, in thereal world of searching, with a farm of computers. I, for myself, havealready built an index of more than one million pages in a singlemachine, with an somewhat old Atlhon 2.4+ and 1 gig of memory, using the0.7.2 version, with very good results, including the actual searching,and gave up the same task, using the 0.8 version, because of the largeamount of time required, time that I did not have, to complete all thetasks, after the fetching of the pages.


How fast do you need to go?

I did a 1 million page crawl today with trunk version of nutch patchedwith NUTCH-395 [1]. total time for fetching was little over 7 hrs.

But of course there are still various ways to optimize fetching process- for example optimizing the scheduling of urls to fetch, improvingnutch agent to use Accept header [2] for failing fast on content itcannot handle etc.


[1]http://issues.apache.org/jira/browse/NUTCH-395
[2]http://www.mail-archive.com/[email protected]/msg04344.html

--
 Sami Siren

Re: Strategic Direction of Nutch

Reply via email to