Just to put in my view. Stefan Groschupf wrote: > Hi Andrzej, > > thank you for taking the time to comment, I highly value your comments. > >> * I guess that for each case where Nutch seems inappropriate I could >> give you a counter-example of Nutch being used commercially with much >> success. I guess it depends on a particular application and the type >> of customer. > > Yes, it would be interesting to hear who use nutch .8 _successfully_ in > production.
Although I can't say who we are yet as we are in the middle of private equity funding, we have built a production version categorization search engine that uses the Nutch .8 and hadoop .4 code base that we are currently in the process of scaling to 100M pages. > >> * no doubt Nutch has its warts - the plugin system could be simpler, >> for example ;) but hey, it's great that we have a plugin system at >> all! It would be easier now to refactor Nutch to use a different >> plugin system than it was to go from the completely monolithic design >> to the plugin system ... As with any open source project - if you >> don't like it, fix it and contribute the fix. > > Sure - I tried that more than once - but I do not want to start this > discussion again. > >> * things won't happen magically unless there is a greater involvement >> of skilled developers. "One way road" - well, with limited resources >> that this project has at the moment the only way is to gradually >> improve, we cannot afford to abandon the current codebase and start >> from scratch. > > I agree - the problem are skilled developers, I remember more than one > offer of different companies to dedicate developers to the project, but > looks like there was no interest. I completely agree with this. I am interested in devoting as much time as possible to seeing the success of Nutch, Hadoop, and Lucene. As our business grows I would also be willing to devote developers full time to work on Nutch, Hadoop, and Lucene. I think that at least one company needs to come out and have a production search engine that is competition, however small, to the googles and yahoos of the world, built on Nutch and Hadoop. I thought that was the original goal of Nutch. I know there are some out there right now like Mozdex, but I mean a true billion page system. I think the .8 codebase, and yes improvements could be made, is capable of supporting such a system. I think then you will see many more developers become interested in the project. If you build it they will come. I will say that it is difficult for people to understand how to get more involved. I have been working with Nutch and Hadoop for almost a year now on a daily basis and only now am I understanding how to contribute through jira, etc. There needs to be more guidance in helping developers contribute. For example if you want to develop a new piece of function they do x, y, and z. Here is how to patch your system. If you want to develop a patch then here are the steps. I have programmed in Java for many years but haven't worked on many open source projects before. The process of how they work isn't explicit and it needs to be. We worked up many patches for issues we came up against in the .8 and .4 codebases but they were never contributed because, as stupid as it might sound, we really don't know how to give it back. The best thing I thought I could do was to help answer questions on the list. Again just need a little guidance. >> Are you willing to spend the time and do the required refactoring? >> Anyone else? Yes, I am and I currently have 2 other developers that can help. > > In general there was some emotional discussion about API changes. Since > nutch is a 0.x and also a software and not a library more frequent > refactorings had may be improved the maintainability of the code over > the time. > > Sure if we start a 2.x branch and if I'm not developing for the trash or > "jira nirvana", I can imaging to contribute. I would rethink and rewrite > some major parts (e.g. remove the reusage of objects with a complex > states and endless if than else conditions no body can debug) and I > believe that is not difficult. I'm not talking about the algorithm stuff > here. > >>> May be one day we can get some developer together first think about a >>> good extendable design and than start a 2.x stream or a new project. >> >> I hope so too. But as Steve B. said once, what we need is "developers, >> developers, developers ..." ;) > > I agree, however it must be attractive for developers to spend time in a > open source project. We saw many developers here. You are the only one > left that does some serious development and I can't find words how much > respect I have for your work. You are the only one that is able to fix > serious bugs. We also have much respect for you Andrzej. You may have more developers than you think. They might just not know how to contribute. > It is very less attractive to developers spending weeks to find a bug > like the regular expression one. Than such a bug sits there for month in > the jira being rejected. Sure if nobody of the contributors run nutch > with a 500 mio url web db, than it might be difficult to reproduce such > a bug. If you have a set of a such issues (another one is the gui etc.) > you decide to run your very own nutch brunch in your home svn. At least > all of my customers did over the time. The result - no public nutch > contributions, no developers. > >> Nutch, as it is now, is not too well-focused, so that may be the >> reason why it doesn't attract too many developers - and casual users >> find it perhaps too difficult to get interested enough to dig deeper. Definitely agree. Better documentation is needed to attract the more "casual" developers. We would be willing to help produce this. > > I agree that is another issue, since nutch tries to solve to many > problems at the same time the code is to difficult to understand for > newbies. > > >> On one end of spectrum we have small desktop installations in mind, on >> the other end we have scalable 1 bln page server farms ... it's hard >> to satisfy everyone, and the current design is not that satisfactory >> for either group. So, I think a better focus is needed, combined with >> design that satisfies either one or the other group - or maybe two >> designs for each group, assuming we can motivate enough people to >> participate in each sub-project. > > Sounds like a good idea! :-) Agreed. > >>> And ... yes no opic and yes definitely no plugin architecture (I feel >>> very sorry for all that wast so much life time >> >> Ah, the more I study the theory behind PageRank calculation the more I >> think OPIC is an excellent solution to this hard problem - but our >> current implementation is broken. > > Yes - very much, a search engine that need to recrawl from scratch each > time to get sense-fully index scores - that is really broken. > However at least the "page rank" implementation in nutch .7 worked great > for me, it just didn't scaled that well. > > >> I'm slowly coming to a point where I should be able to fix it - but >> let's not throw out the baby with the water ... > Wow, I hold my finger crossed! There is a great book on this. It is 0691122024. Andrzej send me your address and I will buy and ship you a copy if you don't have it. We would also be willing to help develop this functionality further. > >>> because of my terrible complicate plugin system) but a clean IOC >>> design with lightweight default interface implementations and a great >>> test coverage. >>> Anyway just my *very little* point of view based on 3.5 years nutch >>> experience. >> I'm looking forward to your patches that implement the clean IOC >> design ;) Seriously - if you can show how to refactor a portion of >> Nutch to a clean IOC design, we will start refactoring the rest of it >> in this direction. > Would be happy to help here as well. > No - sorry I'm personal too tiered doing patches. I already talked to a > set of people and at least 2 good developers would be serious interested > in writing a search engine from scratch based on hadoop by "reusing" as > much nutch code as sense-fully. Another 2 would be interested. All those > people worked with nutch or working in the IR research area or in > vertical search companies. May be a interesting starting point for a > nice summer project. > We might even find some company that would sponsor some work - I know at > least 2 that would be interested. You might know one as well. :-) > > Anyway don't count on that - I don't know - at least in the moment I can > imaging more interesting things in my spear time than doing nutch patches. > > > I don't want to start a emotional discussion here, however talking about > the problem in public might help. > Cheers, > Stefan > I can definitely see a desire to re-write but I think even if you re-write you are still going to have the same problem. Search is hard and without guidance we can't get enough developers to understand what they need to know to help. At this time I don't think it is a design problem I think it is a people problem. I will be more than willing to head up training, documenting, and helping developers get up to speed. I just need direction in this area myself. Dennis Kubes ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers