Re: [Nutch-general] Strategic Direction of Nutch

Andrzej Bialecki Mon, 13 Nov 2006 15:22:09 -0800

(Sorry for the long post, but I felt this issue needs to be made very 
clear ...)

Nutch Newbie wrote:
> Here is some general comments:
>
> The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206
> is not solved..Have a look.
>
> http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html
>
> Well, again its a wishful thinking to ask for many developers, patch
> and bug reporting and bug fixes - without focusing on the need of such
> developers.  Same example again!  hadoop-206 was reported and it is
> still not solved. So how do you expect to get more developers? when

Before we get carried away, let me state clearly that reporting a 
problem and providing a fix for a problem are two different things - 
Hadoop-206 is a problem report, but without a fix. If there was a fix 
for it, it would be most probably applied long time ago. The reason it's 
not solved is that it's not a high priority issue for active developers, 
and there is no easy fix to be applied.

If this issue is a high priority for you, then fix it and provide a 
patch so that others may benefit from it - that's how Open Source 
projects work. Pointing fingers and saying "you should have done this or 
that long time ago" won't fix the stuff by itself. Are you a developer? 
Then fix it. If not, then you should now understand why we kindly _ask_ 
for more developers to get involved. Reporting problems is very useful 
and crucial, but so is having the skilled manpower to fix them.

>
> See when the focus of the development is to solve 1000 machine/ large
> install,  then the issues like 206 is never solved. Thus asking for
> more developer to provide bug fixes is a wishful thinking.

No, we ask because we really need developers who could help us, who take 
initiative to fix something if it's broken in their particular use case.

The focus is on large clusters because that's what majority of active 
developers use. If there were more active developers with focus on small 
clusters (or single machine deployments) - hint, hint - the focus would 
move in this direction. There is no conspiracy here, nor do we willfully 
ignore the needs of people with small deployments - it's just a matter 
of what is the priority among active developers.

Complaining about this won't help as much as providing actual patches to 
solve issues. Until then, a faster single-machine deployment is a "nice 
to have" thing, but not the top priority.

>
> Sorry if I knew how to solve map/reduce problem i would fix it and
> submit patch and I am sure I am not the only one here. Map/reduce
> stuff is not really walk in the park :-).
>
> The current direction of nutch development is geared towards large
> install and its a great software.  However lets not pretend/preach
> Nutch is good for small install, Nutch left that life when it embraced
> Map/Reduce i.e. starting from 0.8.

You need to take into account that this is the first official release of 
Nutch after a major brain surgery, so it's no wonder things are a little 
bit twitchy ;) There are in fact very few, if any, places in Nutch that 
still use the same data models and algorithms as they did in 0.7 era.

Having said that, I just did a crawl of 1 mln pages within ~30 hours, on 
a single machine, which should give me a 100 mln collection within 2 
months. This speed is acceptable for me, even if it's slower than 0.7, 
and if one day I want to go beyond 100 mln pages I know that I will be 
able to do it - which _cannot_ be said about 0.7 ... So, you can look at 
it as a tradeoff.

(BTW: the issue with slow reduce phase is well known, and people from 
the Hadoop project are working on it even as we speak).

Oh, and regarding the subject of this thread - the strategic direction 
of Nutch is to provide a viable platform for medium to large scale 
search engines, be they Internet-wide or Intranet / constrained to a 
specific area. This was the original goal of the project, and it still 
reflects our ambitions. HOWEVER, if a significant part of active 
community is focused on small / embedded deployments, then you need to 
make your voice heard _and_ start contributing to the project so that it 
becomes a viable solution also to your needs.

I hope this long answer helps you to understand why things are the way 
they are ... ;)

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Strategic Direction of Nutch

Reply via email to