Dennis Kubes wrote: > I completely agree with this. I am interested in devoting as much > time as possible to seeing the success of Nutch, Hadoop, and Lucene. > As our business grows I would also be willing to devote developers > full time to work on Nutch, Hadoop, and Lucene. > > I think that at least one company needs to come out and have a > production search engine that is competition, however small, to the > googles and yahoos of the world, built on Nutch and Hadoop. I thought > that was the original goal of Nutch. I know there are some out there > right now like Mozdex, but I mean a true billion page system. I think > the .8 codebase, and yes improvements could be made, is capable of > supporting such a system. I think then you will see many more > developers become interested in the project. If you build it they > will come.
Sure, I'd love to point people to such a system. But did you do a calculation how much money in the initial investment, and then ongoing costs, is needed to maintain such an index? It cannot happen just because of someone's goodwill, there must be a sound business idea behind it, and a team of dedicated people to make it happen and persevere - not just to demonstrate how good Nutch is, but to keep up for the sake of their own business. > > I will say that it is difficult for people to understand how to get > more involved. I have been working with Nutch and Hadoop for almost a > year now on a daily basis and only now am I understanding how to > contribute through jira, etc. There needs to be more guidance in > helping developers contribute. For example if you want to develop a > new piece of function they do x, y, and z. Here is how to patch your > system. If you want to develop a patch then here are the steps. I > have programmed in Java for many years but haven't worked on many open > source projects before. The process of how they work isn't explicit > and it needs to be. Hmm. I might not be objective here anymore. There is however some documentation already on the Wiki, which explains how to contribute - if you feel it's inadequate please use your hard-earned experience to improve it. > > We worked up many patches for issues we came up against in the .8 and > .4 codebases but they were never contributed because, as stupid as it > might sound, we really don't know how to give it back. The best thing > I thought I could do was to help answer questions on the list. Again > just need a little guidance. > >>> Are you willing to spend the time and do the required refactoring? >>> Anyone else? > > Yes, I am and I currently have 2 other developers that can help. Sounds great. We could start by creating a new page on Wiki, which would collect our vision for Nutch - as I mentioned to Stefan, I think we should take a step back, and think about the strategy for the next 1-2 years of Nutch development, and what is the target audience. >> Sure if we start a 2.x branch and if I'm not developing for the trash >> or "jira nirvana", I can imaging to contribute. I Just a quick comment: "jira nirvana" (which I believe refers to patches sitting idle in Jira for a long time) is not caused by ill will or disrespect for contributors, but foremost by limited human resources. If we want to maintain a certain level of quality, these patches cannot be applied blindly, but need to be reviewed, analyzed, applied, tested, and committed. That's an awful lot of work for 2-3 people, who also have other things to do ... >> It is very less attractive to developers spending weeks to find a bug >> like the regular expression one. Than such a bug sits there for month >> in the jira being rejected. Sure if nobody of the contributors run >> nutch with a 500 mio url It's not being rejected - see the comments on that issue, there is an overall agreement that it's ok; it simply hasn't been applied yet. See above for the why. >>> I'm slowly coming to a point where I should be able to fix it - but >>> let's not throw out the baby with the water ... >> Wow, I hold my finger crossed! > > There is a great book on this. It is 0691122024. Andrzej send me > your address and I will buy and ship you a copy if you don't have it. Too late :) I found it two weeks ago, and it's already on its merry way - but thanks for the offer. > We would also be willing to help develop this functionality further. I started working on a testbed as a part of another commercial project, it's likely that I could get a release from the customer to contribute this code to the project. A testbed is a prerequisite for any serious work on ranking and web graph. (It's quite unfortunate that the best-of-breed open source framework for working with web graphs is licensed under LGPL ...) > > I can definitely see a desire to re-write but I think even if you > re-write you are still going to have the same problem. Search is hard > and without guidance we can't get enough developers to understand what > they need to know to help. Indeed. People often don't appreciate how much heuristics and trials, beyond pure academic-level IR, is needed to come up with a system that gives a decent quality of results, and is manageable. Nutch may not be perfect, but there's a lot of this specific knowledge already accumulated here. > At this time I don't think it is a design problem I think it is a > people problem. I will be more than willing to head up training, > documenting, and helping developers get up to speed. I just need > direction in this area myself. I believe that at this point it's crucial to keep the project well-focused (at the moment I think the main focus is on larger installations, and not the small ones), and also to make Nutch attractive to developers as a reusable "search engine" component. Let's continue the discussion. I'll create the page on Wiki, please feel free to add your thoughts. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers