Stefan Groschupf wrote:
Hi Scott,

feel free - I have no options on that.

From my very little point of view the nutch > .8 source stream is a one way street. In all my projects we move as far as possible away from nutch. I like hadoop a lot and writing customer tools on top of it is - that easy. But nutch .8 was a proof of concept for the early hadoop. There is only one serious developer left and wow how great he does his job - but nutch >.8 is just too monolithic, to difficult to extend, to difficult to debug, to difficult to integrate for a serious mission critical application. I spend a signification part of my life daily working with nutch, but if someone would ask - I would answer don't use it.

Let me comment on what you said:

* I guess that for each case where Nutch seems inappropriate I could give you a counter-example of Nutch being used commercially with much success. I guess it depends on a particular application and the type of customer.

* no doubt Nutch has its warts - the plugin system could be simpler, for example ;) but hey, it's great that we have a plugin system at all! It would be easier now to refactor Nutch to use a different plugin system than it was to go from the completely monolithic design to the plugin system ... As with any open source project - if you don't like it, fix it and contribute the fix.

* things won't happen magically unless there is a greater involvement of skilled developers. "One way road" - well, with limited resources that this project has at the moment the only way is to gradually improve, we cannot afford to abandon the current codebase and start from scratch. Are you willing to spend the time and do the required refactoring? Anyone else?

May be one day we can get some developer together first think about a good extendable design and than start a 2.x stream or a new project.

I hope so too. But as Steve B. said once, what we need is "developers, developers, developers ..." ;)

Nutch, as it is now, is not too well-focused, so that may be the reason why it doesn't attract too many developers - and casual users find it perhaps too difficult to get interested enough to dig deeper. On one end of spectrum we have small desktop installations in mind, on the other end we have scalable 1 bln page server farms ... it's hard to satisfy everyone, and the current design is not that satisfactory for either group. So, I think a better focus is needed, combined with design that satisfies either one or the other group - or maybe two designs for each group, assuming we can motivate enough people to participate in each sub-project.

And ... yes no opic and yes definitely no plugin architecture (I feel very sorry for all that wast so much life time

Ah, the more I study the theory behind PageRank calculation the more I think OPIC is an excellent solution to this hard problem - but our current implementation is broken. I'm slowly coming to a point where I should be able to fix it - but let's not throw out the baby with the water ...

because of my terrible complicate plugin system) but a clean IOC design with lightweight default interface implementations and a great test coverage. Anyway just my *very little* point of view based on 3.5 years nutch experience.
I'm looking forward to your patches that implement the clean IOC design ;) 
Seriously - if you can show how to refactor a portion of Nutch to a clean IOC 
design, we will start refactoring the rest of it in this direction.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to