Re: [Nutch-general] Strategic Direction of Nutch

Nutch Newbie Mon, 13 Nov 2006 14:23:04 -0800

Here is some general comments:

The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206
is not solved..Have a look.


http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html

Well, again its a wishful thinking to ask for many developers, patch
and bug reporting and bug fixes - without focusing on the need of such
developers.  Same example again!  hadoop-206 was reported and it is
still not solved. So how do you expect to get more developers? when
the developer just have 1 machine and it takes 3 days to perform any
serious testing/fetching/indexing or any sort development? Developers
moves on...

See when the focus of the development is to solve 1000 machine/ large
install,  then the issues like 206 is never solved. Thus asking for
more developer to provide bug fixes is a wishful thinking.

Sorry if I knew how to solve map/reduce problem i would fix it and
submit patch and I am sure I am not the only one here. Map/reduce
stuff is not really walk in the park :-).

The current direction of nutch development is geared towards large
install and its a great software.  However lets not pretend/preach
Nutch is good for small install, Nutch left that life when it embraced
Map/Reduce i.e. starting from 0.8.

Regards,
On 11/13/06, Uroš Gruber <[EMAIL PROTECTED]> wrote:
> Sami Siren wrote:
> > carmmello wrote:
> >> So, I think, one of the possibilities for the user of a single
> >> machine is that the Nutch developers could use some of their time do
> >> improve the previous 0.7.2, adding to it some new features, with
> >> further releases of this series.  I don`t belive that there are many
> >> Nutch users, in the real world of searching, with a farm of
> >> computers.  I, for myself, have already built an index of more than
> >> one million pages in a single machine, with an somewhat old Atlhon
> >> 2.4+ and 1 gig of memory, using the 0.7.2 version, with very good
> >> results, including the actual searching,  and gave up the same task,
> >> using the 0.8 version, because of the large amount of time required,
> >> time that I did not have,  to complete all the tasks, after the
> >> fetching of the pages.
> >
> > How fast do you need to go?
> >
> > I did a 1 million page crawl today with trunk version of nutch patched
> > with NUTCH-395 [1]. total time for fetching was little over 7 hrs.
> >
> How is that even possible.
>
> I have 3.2GHz pentium with 2G ram. I was same speed problem, because of
> that I setup nutch with single node. About hour ago fetcher was finished
> crawling 1.2 million pages. But this took
>
> 30 hours
>
> Map     2
> <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=all>
>         2
> <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=SUCCESS>
>         0
> <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=MAP&status=FAILED>
>         12-Nov-2006 15:10:35    13-Nov-2006 05:22:16 (14hrs, 11mins, 41sec)
> Reduce  2
> <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=all>
>         2
> <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=SUCCESS>
>         0
> <http://217.72.81.132:50030/jobtaskshistory.jsp?jobid=job_0030&jobTrackerId=1163107090350&taskType=REDUCE&status=FAILED>
>         12-Nov-2006 15:10:46    13-Nov-2006 21:59:19 (30hrs, 48mins, 33sec)
>
>
> while map job I have about 24 pages/s. I din't test it with this patch.
> But then reduce job was slow as hell. I realy don't understant what took
> so long. It is almost twice as slow as map job.
>
> I think we need to work on that part.
>
> If I use local mode numbers are even worse.
>
> I can't imagine how much it took to crawl let say 10mio pages.
>
> I would like to help making nutch faster, but there is some part I don't
> quite understand. I need to work on that first.
>
> regards
>
> Uros
> > But of course there are still various ways to optimize fetching
> > process - for example optimizing the scheduling of urls to fetch,
> > improving nutch agent to use Accept header [2] for failing fast on
> > content it cannot handle etc.
> >
> > [1]http://issues.apache.org/jira/browse/NUTCH-395
> > [2]http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg04344.html
> >
> > --
> >  Sami Siren
>
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Strategic Direction of Nutch

Reply via email to