The Future of Nutch, reactivated

Kirby Bohling Thu, 14 May 2009 17:24:01 -0700

All,

Sorry that I didn't reply, and thus this isn't threaded properly.
I've lurked on the list via the RSS feed, I subscribed so I could put
in my two cents worth.  I've recently starting using git to maintain a
local branch of Nutch.  My hope is to get my employer to let me
contribute "just engineering" back to Nutch.  We'd like to customize
Nutch in various ways and use that as the basis of internal R&D and
potentially some products that we'd not contribute.  The other things
that just make Nutch more flexible I'd like to contribute to.


I've been working with Nutch on and off since sometime in November or
so for my job.  A couple of thoughts:

1. Nutch is too monolithic
2. Nutch does the heavy lifting of a framework for a distributed system well.
3. Nutch doesn't really keep all the various pieces up to date very well.
4. Nutch requires at least a Bachelors in Nutch to deal with it.
5. Documentation in a Wiki is out of date or is hard to tell which
versions various things work with.
6. Nutch isn't very friendly to simple requests if there a complex
hack could be found. (See recursive file:// handling).

My most recent task was actually to update Tika to use 0.3 and then
use the Tika parsing of the docx format to index.  There were a
several interesting problems, but I want to get permission from my
employer and just show the patches.

I thing we fall into the category of #2 (we wish we could fall into
category #1, but such is life).  We want to make our intranet
searchable on a large scale, and would like to apply the indexing and
retrieval in a number of R&D projects.  We also have an interest in
using Nutch/Lucene/Hadoop in a number of other problems unrelated to
Internet Search.

A couple of things that I'd like to help do (or see done) that would
make Nutch far more framework like so I can assemble the pieces and
parts into what I need:

1. Get Nutch and it's various components into a public Maven
repository, and have public scripts to do the publishing.  Don't care
if that is via Ant with Ivy extensions, or switching to a Maven build
systems.  I've actually started with both approaches.  I'm much better
with Maven, but I think Ivy is more likely to be acceptable to the
project.  I'd like to see this done with Hadoop, and any other core
components.  For now, I'm just maintaining a local POM file that
pushes my builds into our local Maven repository.  I'm going to do
this one way or another, and would love to hear any feedback on an
approach that is acceptable to be contributed back to Nutch.

2. Clearly segregate "Plugins" from "Core" from "Bits that make it an
Application".  I've had fun problems with ClassLoaders, and it seems
that the interface Plugins are allowed to access are "Anything in
Core, or it's existing libraries".  It would seem that it would be
better to have the Core Runtime, which plugins can depend upon, and is
relatively minimal.  Identify the pieces of Nutch which are there to
make it into a program you can run, and push those into a separate
place.  For API's with multiple implementations, it would be nice to
not have be forced to use the same one the Core does when a plugin is
written.

3. As you stated earlier, use OSGi for a plugin system and some type
of dependency injection rather then hand parsed XML files.  I've had
problems with the PluginClassloader (I wanted to use Tika in my
plugin, and because of the plugin/classloader setup, I had to push the
POI libraries into the lib directory rather then in the
src/plugin/plugin-XXX/lib directory).  Well, that was the first
approach, the second was to hack the PluginClassloader to not delegate
to the parent for the "org.apache.tika" package and then provide Tika
in the plugin and it all worked.  Using an well known plug-in system
would have made this much easier.

4. Help transition to using the 3rd party libraries, Nutch still has
an SWF parser that went unmaintained in 2002.  Flash has moved a long
way, it would seem sensible to either jettison that code, or update to
newer versions of the same library by the same project (SWF2).  Not
that I care about Flash, but it seems that parsing isn't something
Nutch proper is focused on.

5. With whatever build system is chosen, figure out how to setup a
Maven build to construct "Out-of-Tree" Nutch plugins without having to
manually deal with all of the various dependencies and packaging
details.

6. Better support for running out of an IDE.  The instructions work,
and are very helpful.  It'd be much nicer to see the use of tools or
scripts to generate a saner system then is currently there (having
each plugin be a project in Eclipse would be a huge help to debugging
weird classpath issues).  Right now, running and compiling inside of
Eclipse isn't at all similar to running it outside, if you have any
time of classloader issues, or multiple conflicting libraries.  Not
that there are any in-tree right now, but I can see how future ones
could exist.

7. Make each plugin be it's own deliverable (even if they are all
maintained inside one tree).  Including the ability to assemble a
".job" from the various internal and external components via Ant
Tasks, or Maven plugins.  It's very tedious to maintain anything
outside of the tree as of right now.

8. Try and reduce the number of "All or Nothing" decisions.  See
NUTCH-407, in order to deal with a straightforward problem, the
suggestion is two switch from Black List out to White List in.  That's
a really significant change to deal with a relatively simple problem.
Especially given that I listed a specific directory I wanted to have
indexed.  If I wanted a higher level to be indexed, I'd have listed
that higher level in my list of seeds.  There are other similar
issues, like the handling of ' ' in URL.  I used the URL Filter
Normalizer to fix that, but I can't see how to apply it to http://,
but not file://.  The handling of the two is very different.  I had
similar problems with "mime.magic".  I can either have it on, or off.
It'd be great if there were much finer granularity there.  The patch
in NUTCH-407 was the first thing applied to my local branch, because
having to remember to modify the urlfilter configuration seems like a
real hassle when I just want to add a new directory.

9. Stop treating Windows as a second class citizen for local-only
configurations.  While it'll work with Cygwin, it is problematic for a
number of my deployments to have to have them install Cygwin.  It'd be
great if we could just write a small abstraction that on Windows
platforms did nothing.  In a non-distributed mode, you can write a
no-op program that does nothing, as long as it returns the proper
error code, it all works fine.  It'd be great if at least that much
could be made to work in Windows without Cygwin.  If I remember right,
there are 3-4 programs (chown, chgrp, whoami, groups?) I think?  They
can do nothing and return garbage and it'll all just work.  I did that
as a cheaters way of getting out of the Cygwin dependency.  I run
Linux on my desktop for 9 of the last 10 years, but it's a simple fact
that to be adopted by many smaller deployments, Window's support is
critical.  I'm going to have to maintain that.  I'm sure others will
too, it'd be really nice if we could push maintain just one copy of it
in the tree.

Anyways, from my perspective, I'd like to help contribute solutions to
various problems.  Nutch is a great concept and even much of the
implementation, but it right now it's just really hard to use from the
outside.

Kirby

The Future of Nutch, reactivated

Reply via email to