Hi Andrzej,

thank you for taking the time to comment, I highly value your comments.

> * I guess that for each case where Nutch seems inappropriate I  
> could give you a counter-example of Nutch being used  commercially  
> with much success. I guess it depends on a particular application  
> and the type of customer.

Yes, it would be interesting to hear who use nutch .8 _successfully_  
in production.

> * no doubt Nutch has its warts - the plugin system could be  
> simpler, for example ;) but hey, it's great that we have a plugin  
> system at all! It would be easier now to refactor Nutch to use a  
> different plugin system than it was to go from the completely  
> monolithic design to the plugin system ... As with any open source  
> project - if you don't like it, fix it and contribute the fix.

Sure - I tried that more than once - but I do not want to start this  
discussion again.

> * things won't happen magically unless there is a greater  
> involvement of skilled developers. "One way road" - well, with  
> limited resources that this project has at the moment the only way  
> is to gradually improve, we cannot afford to abandon the current  
> codebase and start from scratch.

I agree - the problem are skilled developers, I remember more than  
one offer of different companies to dedicate developers to the  
project, but looks like there was no interest.

> Are you willing to spend the time and do the required refactoring?  
> Anyone else?

In general there was some emotional discussion about API changes.  
Since nutch is a 0.x and also a software and not a library more  
frequent refactorings had may be improved the maintainability of the  
code over the time.

Sure if we start a 2.x branch and if I'm not developing for the trash  
or "jira nirvana", I can imaging to contribute. I would rethink and  
rewrite some major parts (e.g. remove the reusage of objects with a  
complex states and endless if than else conditions no body can debug)  
and I believe that is not difficult. I'm not talking about the  
algorithm stuff here.

>> May be one day we can get some developer together first think  
>> about a good extendable design and than start a 2.x stream or a  
>> new project.
>
> I hope so too. But as Steve B. said once, what we need is  
> "developers, developers, developers ..." ;)

I agree, however it must be attractive for developers to spend time  
in a open source project. We saw many developers here. You are the  
only one left that does some serious development and I can't find  
words how much respect I have for your work. You are the only one  
that is able to fix serious bugs.

It is very less attractive to developers spending weeks to find a bug  
like the regular expression one. Than such a bug sits there for month  
in the jira being rejected. Sure if nobody of the contributors run  
nutch with a 500 mio url web db, than it might be difficult to  
reproduce such a bug. If you have a set of a such issues (another one  
is the gui etc.) you decide to run your very own  nutch brunch in  
your home svn. At least all of my customers did over the time. The  
result - no public nutch contributions, no developers.


> Nutch, as it is now, is not too well-focused, so that may be the  
> reason why it doesn't attract too many developers - and casual  
> users find it perhaps too difficult to get interested enough to dig  
> deeper.

I agree that is another issue, since nutch tries to solve to many  
problems at the same time the code is to difficult to understand for  
newbies.


> On one end of spectrum we have small desktop installations in mind,  
> on the other end we have scalable 1 bln page server farms ... it's  
> hard to satisfy everyone, and the current design is not that  
> satisfactory for either group. So, I think a better focus is  
> needed, combined with design that satisfies either one or the other  
> group - or maybe two designs for each group, assuming we can  
> motivate enough people to participate in each sub-project.

Sounds like a good idea! :-)

>> And ... yes no opic and yes definitely no plugin architecture (I  
>> feel very sorry for all that wast so much life time
>
> Ah, the more I study the theory behind PageRank calculation the  
> more I think OPIC is an excellent solution to this hard problem -  
> but our current implementation is broken.

Yes - very much, a search engine that need to recrawl from scratch  
each time to get sense-fully index scores - that is really broken.
However at least the "page rank" implementation in nutch .7 worked  
great for me, it just didn't scaled that well.


> I'm slowly coming to a point where I should be able to fix it - but  
> let's not throw out the baby with the water ...
Wow, I hold my finger crossed!

>> because of my terrible complicate plugin system) but a clean IOC  
>> design with lightweight default interface implementations and a  
>> great test coverage.
>> Anyway just my *very little* point of view based on 3.5 years  
>> nutch experience.
> I'm looking forward to your patches that implement the clean IOC  
> design ;) Seriously - if you can show how to refactor a portion of  
> Nutch to a clean IOC design, we will start refactoring the rest of  
> it in this direction.

No - sorry I'm personal too tiered doing patches. I already talked to  
a set of people and at least 2 good developers would be serious  
interested in writing a search engine from scratch based on hadoop by  
"reusing" as much nutch code as sense-fully. Another 2 would be  
interested. All those people worked with nutch or working in the IR  
research area or in vertical search companies. May be a interesting  
starting point for a nice summer project.
We might even find some company that would sponsor some work - I know  
at least 2 that would be interested. You might know one as well. :-)

Anyway don't count on that - I don't know - at least in the moment I  
can imaging more interesting things in my spear time than doing nutch  
patches.


I don't want to start a emotional discussion here, however talking  
about the problem in public might help.
Cheers,
Stefan


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to