Just to put in my view.
Stefan Groschupf wrote:
Hi Andrzej,
thank you for taking the time to comment, I highly value your comments.
* I guess that for each case where Nutch seems inappropriate I could
give you a counter-example of Nutch being used commercially with much
success. I guess it depends on a particular application and the type
of customer.
Yes, it would be interesting to hear who use nutch .8 _successfully_ in
production.
Although I can't say who we are yet as we are in the middle of private
equity funding, we have built a production version categorization
search engine that uses the Nutch .8 and hadoop .4 code base that we are
currently in the process of scaling to 100M pages.
* no doubt Nutch has its warts - the plugin system could be simpler,
for example ;) but hey, it's great that we have a plugin system at
all! It would be easier now to refactor Nutch to use a different
plugin system than it was to go from the completely monolithic design
to the plugin system ... As with any open source project - if you
don't like it, fix it and contribute the fix.
Sure - I tried that more than once - but I do not want to start this
discussion again.
* things won't happen magically unless there is a greater involvement
of skilled developers. "One way road" - well, with limited resources
that this project has at the moment the only way is to gradually
improve, we cannot afford to abandon the current codebase and start
from scratch.
I agree - the problem are skilled developers, I remember more than one
offer of different companies to dedicate developers to the project, but
looks like there was no interest.
I completely agree with this. I am interested in devoting as much time
as possible to seeing the success of Nutch, Hadoop, and Lucene. As our
business grows I would also be willing to devote developers full time to
work on Nutch, Hadoop, and Lucene.
I think that at least one company needs to come out and have a
production search engine that is competition, however small, to the
googles and yahoos of the world, built on Nutch and Hadoop. I thought
that was the original goal of Nutch. I know there are some out there
right now like Mozdex, but I mean a true billion page system. I think
the .8 codebase, and yes improvements could be made, is capable of
supporting such a system. I think then you will see many more
developers become interested in the project. If you build it they will
come.
I will say that it is difficult for people to understand how to get more
involved. I have been working with Nutch and Hadoop for almost a year
now on a daily basis and only now am I understanding how to contribute
through jira, etc. There needs to be more guidance in helping
developers contribute. For example if you want to develop a new piece
of function they do x, y, and z. Here is how to patch your system. If
you want to develop a patch then here are the steps. I have programmed
in Java for many years but haven't worked on many open source projects
before. The process of how they work isn't explicit and it needs to be.
We worked up many patches for issues we came up against in the .8 and .4
codebases but they were never contributed because, as stupid as it might
sound, we really don't know how to give it back. The best thing I
thought I could do was to help answer questions on the list. Again just
need a little guidance.
Are you willing to spend the time and do the required refactoring?
Anyone else?
Yes, I am and I currently have 2 other developers that can help.
In general there was some emotional discussion about API changes. Since
nutch is a 0.x and also a software and not a library more frequent
refactorings had may be improved the maintainability of the code over
the time.
Sure if we start a 2.x branch and if I'm not developing for the trash or
"jira nirvana", I can imaging to contribute. I would rethink and rewrite
some major parts (e.g. remove the reusage of objects with a complex
states and endless if than else conditions no body can debug) and I
believe that is not difficult. I'm not talking about the algorithm stuff
here.
May be one day we can get some developer together first think about a
good extendable design and than start a 2.x stream or a new project.
I hope so too. But as Steve B. said once, what we need is "developers,
developers, developers ..." ;)
I agree, however it must be attractive for developers to spend time in a
open source project. We saw many developers here. You are the only one
left that does some serious development and I can't find words how much
respect I have for your work. You are the only one that is able to fix
serious bugs.
We also have much respect for you Andrzej.
You may have more developers than you think. They might just not know
how to contribute.
It is very less attractive to developers spending weeks to find a bug
like the regular expression one. Than such a bug sits there for month in
the jira being rejected. Sure if nobody of the contributors run nutch
with a 500 mio url web db, than it might be difficult to reproduce such
a bug. If you have a set of a such issues (another one is the gui etc.)
you decide to run your very own nutch brunch in your home svn. At least
all of my customers did over the time. The result - no public nutch
contributions, no developers.
Nutch, as it is now, is not too well-focused, so that may be the
reason why it doesn't attract too many developers - and casual users
find it perhaps too difficult to get interested enough to dig deeper.
Definitely agree. Better documentation is needed to attract the more
"casual" developers. We would be willing to help produce this.
I agree that is another issue, since nutch tries to solve to many
problems at the same time the code is to difficult to understand for
newbies.
On one end of spectrum we have small desktop installations in mind, on
the other end we have scalable 1 bln page server farms ... it's hard
to satisfy everyone, and the current design is not that satisfactory
for either group. So, I think a better focus is needed, combined with
design that satisfies either one or the other group - or maybe two
designs for each group, assuming we can motivate enough people to
participate in each sub-project.
Sounds like a good idea! :-)
Agreed.
And ... yes no opic and yes definitely no plugin architecture (I feel
very sorry for all that wast so much life time
Ah, the more I study the theory behind PageRank calculation the more I
think OPIC is an excellent solution to this hard problem - but our
current implementation is broken.
Yes - very much, a search engine that need to recrawl from scratch each
time to get sense-fully index scores - that is really broken.
However at least the "page rank" implementation in nutch .7 worked great
for me, it just didn't scaled that well.
I'm slowly coming to a point where I should be able to fix it - but
let's not throw out the baby with the water ...
Wow, I hold my finger crossed!
There is a great book on this. It is 0691122024. Andrzej send me your
address and I will buy and ship you a copy if you don't have it. We
would also be willing to help develop this functionality further.
because of my terrible complicate plugin system) but a clean IOC
design with lightweight default interface implementations and a great
test coverage.
Anyway just my *very little* point of view based on 3.5 years nutch
experience.
I'm looking forward to your patches that implement the clean IOC
design ;) Seriously - if you can show how to refactor a portion of
Nutch to a clean IOC design, we will start refactoring the rest of it
in this direction.
Would be happy to help here as well.
No - sorry I'm personal too tiered doing patches. I already talked to a
set of people and at least 2 good developers would be serious interested
in writing a search engine from scratch based on hadoop by "reusing" as
much nutch code as sense-fully. Another 2 would be interested. All those
people worked with nutch or working in the IR research area or in
vertical search companies. May be a interesting starting point for a
nice summer project.
We might even find some company that would sponsor some work - I know at
least 2 that would be interested. You might know one as well. :-)
Anyway don't count on that - I don't know - at least in the moment I can
imaging more interesting things in my spear time than doing nutch patches.
I don't want to start a emotional discussion here, however talking about
the problem in public might help.
Cheers,
Stefan
I can definitely see a desire to re-write but I think even if you
re-write you are still going to have the same problem. Search is hard
and without guidance we can't get enough developers to understand what
they need to know to help. At this time I don't think it is a design
problem I think it is a people problem. I will be more than willing to
head up training, documenting, and helping developers get up to speed.
I just need direction in this area myself.
Dennis Kubes