Hi Andrzej,
thank you for taking the time to comment, I highly value your comments.
* I guess that for each case where Nutch seems inappropriate I
could give you a counter-example of Nutch being used commercially
with much success. I guess it depends on a particular application
and the type of customer.
Yes, it would be interesting to hear who use nutch .8 _successfully_
in production.
* no doubt Nutch has its warts - the plugin system could be
simpler, for example ;) but hey, it's great that we have a plugin
system at all! It would be easier now to refactor Nutch to use a
different plugin system than it was to go from the completely
monolithic design to the plugin system ... As with any open source
project - if you don't like it, fix it and contribute the fix.
Sure - I tried that more than once - but I do not want to start this
discussion again.
* things won't happen magically unless there is a greater
involvement of skilled developers. "One way road" - well, with
limited resources that this project has at the moment the only way
is to gradually improve, we cannot afford to abandon the current
codebase and start from scratch.
I agree - the problem are skilled developers, I remember more than
one offer of different companies to dedicate developers to the
project, but looks like there was no interest.
Are you willing to spend the time and do the required refactoring?
Anyone else?
In general there was some emotional discussion about API changes.
Since nutch is a 0.x and also a software and not a library more
frequent refactorings had may be improved the maintainability of the
code over the time.
Sure if we start a 2.x branch and if I'm not developing for the trash
or "jira nirvana", I can imaging to contribute. I would rethink and
rewrite some major parts (e.g. remove the reusage of objects with a
complex states and endless if than else conditions no body can debug)
and I believe that is not difficult. I'm not talking about the
algorithm stuff here.
May be one day we can get some developer together first think
about a good extendable design and than start a 2.x stream or a
new project.
I hope so too. But as Steve B. said once, what we need is
"developers, developers, developers ..." ;)
I agree, however it must be attractive for developers to spend time
in a open source project. We saw many developers here. You are the
only one left that does some serious development and I can't find
words how much respect I have for your work. You are the only one
that is able to fix serious bugs.
It is very less attractive to developers spending weeks to find a bug
like the regular expression one. Than such a bug sits there for month
in the jira being rejected. Sure if nobody of the contributors run
nutch with a 500 mio url web db, than it might be difficult to
reproduce such a bug. If you have a set of a such issues (another one
is the gui etc.) you decide to run your very own nutch brunch in
your home svn. At least all of my customers did over the time. The
result - no public nutch contributions, no developers.
Nutch, as it is now, is not too well-focused, so that may be the
reason why it doesn't attract too many developers - and casual
users find it perhaps too difficult to get interested enough to dig
deeper.
I agree that is another issue, since nutch tries to solve to many
problems at the same time the code is to difficult to understand for
newbies.
On one end of spectrum we have small desktop installations in mind,
on the other end we have scalable 1 bln page server farms ... it's
hard to satisfy everyone, and the current design is not that
satisfactory for either group. So, I think a better focus is
needed, combined with design that satisfies either one or the other
group - or maybe two designs for each group, assuming we can
motivate enough people to participate in each sub-project.
Sounds like a good idea! :-)
And ... yes no opic and yes definitely no plugin architecture (I
feel very sorry for all that wast so much life time
Ah, the more I study the theory behind PageRank calculation the
more I think OPIC is an excellent solution to this hard problem -
but our current implementation is broken.
Yes - very much, a search engine that need to recrawl from scratch
each time to get sense-fully index scores - that is really broken.
However at least the "page rank" implementation in nutch .7 worked
great for me, it just didn't scaled that well.
I'm slowly coming to a point where I should be able to fix it - but
let's not throw out the baby with the water ...
Wow, I hold my finger crossed!
because of my terrible complicate plugin system) but a clean IOC
design with lightweight default interface implementations and a
great test coverage.
Anyway just my *very little* point of view based on 3.5 years
nutch experience.
I'm looking forward to your patches that implement the clean IOC
design ;) Seriously - if you can show how to refactor a portion of
Nutch to a clean IOC design, we will start refactoring the rest of
it in this direction.
No - sorry I'm personal too tiered doing patches. I already talked to
a set of people and at least 2 good developers would be serious
interested in writing a search engine from scratch based on hadoop by
"reusing" as much nutch code as sense-fully. Another 2 would be
interested. All those people worked with nutch or working in the IR
research area or in vertical search companies. May be a interesting
starting point for a nice summer project.
We might even find some company that would sponsor some work - I know
at least 2 that would be interested. You might know one as well. :-)
Anyway don't count on that - I don't know - at least in the moment I
can imaging more interesting things in my spear time than doing nutch
patches.
I don't want to start a emotional discussion here, however talking
about the problem in public might help.
Cheers,
Stefan