Re: Reviving Nutch 0.7

ogjunk-nutch Mon, 22 Jan 2007 23:59:35 -0800

All good arguments, and as nobody else voiced the desire to have this other 
branch of Nutch I was rambling about, I'll consider this thread done.
Thanks for the explanations, Doug.

Otis

----- Original Message ----
From: Doug Cutting <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Monday, January 22, 2007 1:40:30 PM
Subject: Re: Reviving Nutch 0.7

[EMAIL PROTECTED] wrote:
> Yes, certainly, anything that can be shared and decoupled from pieces that 
> make each branch (not SVN/CVS branch) different, should be decoupled.  But I 
> was really curious about whether people think this is a valid idea/direction, 
> not necessarily immediately how things should be implemented.  In my mind, 
> one branch is the branch that runs on top of Hadoop, with NameNode, DataNode, 
> HDFS, etc.  That's the branch that's in the trunk.  The other branch is a 
> simpler branch without all that Hadoop stuff, for folks who need to fetch, 
> index, and search a few hundred thousand or a few million or even a few tens 
> of millions of pages, and don't need replication, etc. that comes with 
> Hadoop.  That branch could be based off of 0.7.  I also know that a lot of 
> people are trying to use Nutch to build vertical search engines, so there is 
> also a need for a focused fetcher.  Kelvin Tan brought this up a few times, 
> too, I believe.

Branching doesn't sound like the right solution here.

First, one doesn't need to run any Hadoop daemons to use Nutch: 
everything should run fine in a single process by default.  If there are 
bugs in this they should be logged, folks who care should submit 
high-quality, back-compatible, generally useful patches, and committers 
should work to get these patches committed to the trunk.

Second, if there are to be two modes of operation, wouldn't they best be 
developed in a common source tree, so that they share as much as 
possible and diverge as little as possible?  It seems to me that a good 
architecture would be to agree on a common high-level API, then use two 
different runtimes underneath, one to support distributed operation, and 
one to support standalone operation.  Hey!  That's what Hadoop already 
does!  Maybe it's not perfect and someone can propose a better way to 
share maximal amounts of code, but the code split should probably be 
into different classes and packages in a single source tree maintained 
by a single community of developers, not by branching a single source 
tree in a revision control and splitting the developers.

Third, part of the problem seems like there are two few 
contributors--that the challenges are big and the resources limited. 
Splitting the project will only spread those resources more thinly.

What really is the issue here?  Are good patches languishing?  Are there 
patches that should be committed (meet coding standards, are 
back-compatible, generally useful, etc.) but are not?  A great patch is 
one that a committer can commit it with few worries: it includes new 
unit tests, it passes all existing unit tests, it fixes one thing only, 
etc.  Such patches should not have to wait long for commit.  And once 
someone submits a few such patches, then one should be invited to become 
a committer.

It sounds to me like the problem is that, off-the-shelf, Nutch does not 
yet solve all the problems folks would like it too: e.g., it has never 
done a good job with incremental indexing.  Folks see progress made on 
scalability, but really wish it were making more progress on 
incrementality or something else.  But it's not going to make progress 
on incrementality without someone doing the work.  A fork or a branch 
isn't going to do the work.  I don't see any reason that the work cannot 
be done right now.  It can be done incrementally: e.g., if the web db 
API seems inappropriate for incremental updates, then someone should 
submit a patch that provides an incremental web db API, updating the 
fetcher and indexer to use this.  A design for this on the wiki would be 
a good place to start.

Finally, web crawling, indexing and searching are data-intensive. 
Before long, users will want to index tens or hundreds of millions of 
pages.  Distributed operation is soon required at this scale, and 
batch-mode is an order-of-magnitude faster.  So be careful before you 
threw those features out: you might want them back soon.

Doug

Re: Reviving Nutch 0.7

Reply via email to