Dear Sami Siren,
Thank you for your prompt answer, but my problem with 0.8.1 was not with the
fetching time itself (although your speed in doing so is a lot greater than
mine), that is on pair with 0.7.2. My problem is with the time for all the
post fetching processes, that is much longer than with 0.7.2. When I
indexed that million pages, it took me about the weekend (the whole
process); when I tried to index 500,000 pages with 0.8.1, the fetching
went ok, but, after that, I could not get the job done. The weekend went by
and I just could not wait anymore. That`s why I think that, in many cases,
in using a single machine, 0.7.2 could be a better choice, mainly if this
version is updated.
Regads
----- Original Message -----
From: "Sami Siren" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Monday, November 13, 2006 4:28 PM
Subject: Re: Strategic Direction of Nutch
carmmello wrote:
So, I think, one of the possibilities for the user of a single machine is
that the Nutch developers could use some of their time do improve the
previous 0.7.2, adding to it some new features, with further releases of
this series. I don`t belive that there are many Nutch users, in the real
world of searching, with a farm of computers. I, for myself, have
already built an index of more than one million pages in a single
machine, with an somewhat old Atlhon 2.4+ and 1 gig of memory, using the
0.7.2 version, with very good results, including the actual searching,
and gave up the same task, using the 0.8 version, because of the large
amount of time required, time that I did not have, to complete all the
tasks, after the fetching of the pages.
How fast do you need to go?
I did a 1 million page crawl today with trunk version of nutch patched
with NUTCH-395 [1]. total time for fetching was little over 7 hrs.
But of course there are still various ways to optimize fetching process -
for example optimizing the scheduling of urls to fetch, improving nutch
agent to use Accept header [2] for failing fast on content it cannot
handle etc.
[1]http://issues.apache.org/jira/browse/NUTCH-395
[2]http://www.mail-archive.com/[email protected]/msg04344.html
--
Sami Siren
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.14.4/532 - Release Date: 13/11/2006