Dennis Kubes wrote:
> I completely agree with this.  I am interested in devoting as much 
> time as possible to seeing the success of Nutch, Hadoop, and Lucene.  
> As our business grows I would also be willing to devote developers 
> full time to work on Nutch, Hadoop, and Lucene.
>
> I think that at least one company needs to come out and have a 
> production search engine that is competition, however small, to the 
> googles and yahoos of the world, built on Nutch and Hadoop.  I thought 
> that was the original goal of Nutch.  I know there are some out there 
> right now like Mozdex, but I mean a true billion page system.  I think 
> the .8 codebase, and yes improvements could be made, is capable of 
> supporting such a system.  I think then you will see many more 
> developers become interested in the project.  If you build it they 
> will come.

Sure, I'd love to point people to such a system. But did you do a 
calculation how much money in the initial investment, and then ongoing 
costs, is needed to maintain such an index? It cannot happen just 
because of someone's goodwill, there must be a sound business idea 
behind it, and a team of dedicated people to make it happen and 
persevere - not just to demonstrate how good Nutch is, but to keep up 
for the sake of their own business.

>
> I will say that it is difficult for people to understand how to get 
> more involved.  I have been working with Nutch and Hadoop for almost a 
> year now on a daily basis and only now am I understanding how to 
> contribute through jira, etc.  There needs to be more guidance in 
> helping developers contribute.  For example if you want to develop a 
> new piece of function they do x, y, and z.  Here is how to patch your 
> system. If you want to develop a patch then here are the steps.  I 
> have programmed in Java for many years but haven't worked on many open 
> source projects before.  The process of how they work isn't explicit 
> and it needs to be.

Hmm. I might not be objective here anymore. There is however some 
documentation already on the Wiki, which explains how to contribute - if 
you feel it's inadequate please use your hard-earned experience to 
improve it.

>
> We worked up many patches for issues we came up against in the .8 and 
> .4 codebases but they were never contributed because, as stupid as it 
> might sound, we really don't know how to give it back.  The best thing 
> I thought I could do was to help answer questions on the list.  Again 
> just need a little guidance.
>
>>> Are you willing to spend the time and do the required refactoring? 
>>> Anyone else?
>
> Yes, I am and I currently have 2 other developers that can help.

Sounds great. We could start by creating a new page on Wiki, which would 
collect our vision for Nutch - as I mentioned to Stefan, I think we 
should take a step back, and think about the strategy for the next 1-2 
years of Nutch development, and what is the target audience.

>> Sure if we start a 2.x branch and if I'm not developing for the trash 
>> or "jira nirvana", I can imaging to contribute. I 

Just a quick comment: "jira nirvana" (which I believe refers to patches 
sitting idle in Jira for a long time) is not caused by ill will or 
disrespect for contributors, but foremost by limited human resources. If 
we want to maintain a certain level of quality, these patches cannot be 
applied blindly, but need to be reviewed, analyzed, applied, tested, and 
committed. That's an awful lot of work for 2-3 people, who also have 
other things to do ...



>> It is very less attractive to developers spending weeks to find a bug 
>> like the regular expression one. Than such a bug sits there for month 
>> in the jira being rejected. Sure if nobody of the contributors run 
>> nutch with a 500 mio url 

It's not being rejected - see the comments on that issue, there is an 
overall agreement that it's ok; it simply hasn't been applied yet. See 
above for the why.


>>> I'm slowly coming to a point where I should be able to fix it - but 
>>> let's not throw out the baby with the water ...
>> Wow, I hold my finger crossed!
>
> There is a great book on this.  It is 0691122024.  Andrzej send me 
> your address and I will buy and ship you a copy if you don't have it.  

Too late :) I found it two weeks ago, and it's already on its merry way 
- but thanks for the offer.

> We would also be willing to help develop this functionality further.

I started working on a testbed as a part of another commercial project, 
it's likely that I could get a release from the customer to contribute 
this code to the project. A testbed is a prerequisite for any serious 
work on ranking and web graph.

(It's quite unfortunate that the best-of-breed open source framework for 
working with web graphs is licensed under LGPL ...)

>
> I can definitely see a desire to re-write but I think even if you 
> re-write you are still going to have the same problem.  Search is hard 
> and without guidance we can't get enough developers to understand what 
> they need to know to help.

Indeed. People often don't appreciate how much heuristics and trials, 
beyond pure academic-level IR, is needed to come up with a system that 
gives a decent quality of results, and is manageable. Nutch may not be 
perfect, but there's a lot of this specific knowledge already 
accumulated here.


> At this time I don't think it is a design problem I think it is a 
> people problem.  I will be more than willing to head up training, 
> documenting, and helping developers get up to speed. I just need 
> direction in this area myself.

I believe that at this point it's crucial to keep the project 
well-focused (at the moment I think the main focus is on larger 
installations, and not the small ones), and also to make Nutch 
attractive to developers as a reusable "search engine" component.

Let's continue the discussion. I'll create the page on Wiki, please feel 
free to add your thoughts.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to