All, Sorry that I didn't reply, and thus this isn't threaded properly. I've lurked on the list via the RSS feed, I subscribed so I could put in my two cents worth. I've recently starting using git to maintain a local branch of Nutch. My hope is to get my employer to let me contribute "just engineering" back to Nutch. We'd like to customize Nutch in various ways and use that as the basis of internal R&D and potentially some products that we'd not contribute. The other things that just make Nutch more flexible I'd like to contribute to.
I've been working with Nutch on and off since sometime in November or so for my job. A couple of thoughts: 1. Nutch is too monolithic 2. Nutch does the heavy lifting of a framework for a distributed system well. 3. Nutch doesn't really keep all the various pieces up to date very well. 4. Nutch requires at least a Bachelors in Nutch to deal with it. 5. Documentation in a Wiki is out of date or is hard to tell which versions various things work with. 6. Nutch isn't very friendly to simple requests if there a complex hack could be found. (See recursive file:// handling). My most recent task was actually to update Tika to use 0.3 and then use the Tika parsing of the docx format to index. There were a several interesting problems, but I want to get permission from my employer and just show the patches. I thing we fall into the category of #2 (we wish we could fall into category #1, but such is life). We want to make our intranet searchable on a large scale, and would like to apply the indexing and retrieval in a number of R&D projects. We also have an interest in using Nutch/Lucene/Hadoop in a number of other problems unrelated to Internet Search. A couple of things that I'd like to help do (or see done) that would make Nutch far more framework like so I can assemble the pieces and parts into what I need: 1. Get Nutch and it's various components into a public Maven repository, and have public scripts to do the publishing. Don't care if that is via Ant with Ivy extensions, or switching to a Maven build systems. I've actually started with both approaches. I'm much better with Maven, but I think Ivy is more likely to be acceptable to the project. I'd like to see this done with Hadoop, and any other core components. For now, I'm just maintaining a local POM file that pushes my builds into our local Maven repository. I'm going to do this one way or another, and would love to hear any feedback on an approach that is acceptable to be contributed back to Nutch. 2. Clearly segregate "Plugins" from "Core" from "Bits that make it an Application". I've had fun problems with ClassLoaders, and it seems that the interface Plugins are allowed to access are "Anything in Core, or it's existing libraries". It would seem that it would be better to have the Core Runtime, which plugins can depend upon, and is relatively minimal. Identify the pieces of Nutch which are there to make it into a program you can run, and push those into a separate place. For API's with multiple implementations, it would be nice to not have be forced to use the same one the Core does when a plugin is written. 3. As you stated earlier, use OSGi for a plugin system and some type of dependency injection rather then hand parsed XML files. I've had problems with the PluginClassloader (I wanted to use Tika in my plugin, and because of the plugin/classloader setup, I had to push the POI libraries into the lib directory rather then in the src/plugin/plugin-XXX/lib directory). Well, that was the first approach, the second was to hack the PluginClassloader to not delegate to the parent for the "org.apache.tika" package and then provide Tika in the plugin and it all worked. Using an well known plug-in system would have made this much easier. 4. Help transition to using the 3rd party libraries, Nutch still has an SWF parser that went unmaintained in 2002. Flash has moved a long way, it would seem sensible to either jettison that code, or update to newer versions of the same library by the same project (SWF2). Not that I care about Flash, but it seems that parsing isn't something Nutch proper is focused on. 5. With whatever build system is chosen, figure out how to setup a Maven build to construct "Out-of-Tree" Nutch plugins without having to manually deal with all of the various dependencies and packaging details. 6. Better support for running out of an IDE. The instructions work, and are very helpful. It'd be much nicer to see the use of tools or scripts to generate a saner system then is currently there (having each plugin be a project in Eclipse would be a huge help to debugging weird classpath issues). Right now, running and compiling inside of Eclipse isn't at all similar to running it outside, if you have any time of classloader issues, or multiple conflicting libraries. Not that there are any in-tree right now, but I can see how future ones could exist. 7. Make each plugin be it's own deliverable (even if they are all maintained inside one tree). Including the ability to assemble a ".job" from the various internal and external components via Ant Tasks, or Maven plugins. It's very tedious to maintain anything outside of the tree as of right now. 8. Try and reduce the number of "All or Nothing" decisions. See NUTCH-407, in order to deal with a straightforward problem, the suggestion is two switch from Black List out to White List in. That's a really significant change to deal with a relatively simple problem. Especially given that I listed a specific directory I wanted to have indexed. If I wanted a higher level to be indexed, I'd have listed that higher level in my list of seeds. There are other similar issues, like the handling of ' ' in URL. I used the URL Filter Normalizer to fix that, but I can't see how to apply it to http://, but not file://. The handling of the two is very different. I had similar problems with "mime.magic". I can either have it on, or off. It'd be great if there were much finer granularity there. The patch in NUTCH-407 was the first thing applied to my local branch, because having to remember to modify the urlfilter configuration seems like a real hassle when I just want to add a new directory. 9. Stop treating Windows as a second class citizen for local-only configurations. While it'll work with Cygwin, it is problematic for a number of my deployments to have to have them install Cygwin. It'd be great if we could just write a small abstraction that on Windows platforms did nothing. In a non-distributed mode, you can write a no-op program that does nothing, as long as it returns the proper error code, it all works fine. It'd be great if at least that much could be made to work in Windows without Cygwin. If I remember right, there are 3-4 programs (chown, chgrp, whoami, groups?) I think? They can do nothing and return garbage and it'll all just work. I did that as a cheaters way of getting out of the Cygwin dependency. I run Linux on my desktop for 9 of the last 10 years, but it's a simple fact that to be adopted by many smaller deployments, Window's support is critical. I'm going to have to maintain that. I'm sure others will too, it'd be really nice if we could push maintain just one copy of it in the tree. Anyways, from my perspective, I'd like to help contribute solutions to various problems. Nutch is a great concept and even much of the implementation, but it right now it's just really hard to use from the outside. Kirby