[EMAIL PROTECTED] wrote:
Yes, certainly, anything that can be shared and decoupled from pieces that make 
each branch (not SVN/CVS branch) different, should be decoupled.  But I was 
really curious about whether people think this is a valid idea/direction, not 
necessarily immediately how things should be implemented.  In my mind, one 
branch is the branch that runs on top of Hadoop, with NameNode, DataNode, HDFS, 
etc.  That's the branch that's in the trunk.  The other branch is a simpler 
branch without all that Hadoop stuff, for folks who need to fetch, index, and 
search a few hundred thousand or a few million or even a few tens of millions 
of pages, and don't need replication, etc. that comes with Hadoop.  That branch 
could be based off of 0.7.  I also know that a lot of people are trying to use 
Nutch to build vertical search engines, so there is also a need for a focused 
fetcher.  Kelvin Tan brought this up a few times, too, I believe.

Branching doesn't sound like the right solution here.

First, one doesn't need to run any Hadoop daemons to use Nutch: everything should run fine in a single process by default. If there are bugs in this they should be logged, folks who care should submit high-quality, back-compatible, generally useful patches, and committers should work to get these patches committed to the trunk.

Second, if there are to be two modes of operation, wouldn't they best be developed in a common source tree, so that they share as much as possible and diverge as little as possible? It seems to me that a good architecture would be to agree on a common high-level API, then use two different runtimes underneath, one to support distributed operation, and one to support standalone operation. Hey! That's what Hadoop already does! Maybe it's not perfect and someone can propose a better way to share maximal amounts of code, but the code split should probably be into different classes and packages in a single source tree maintained by a single community of developers, not by branching a single source tree in a revision control and splitting the developers.

Third, part of the problem seems like there are two few contributors--that the challenges are big and the resources limited. Splitting the project will only spread those resources more thinly.

What really is the issue here? Are good patches languishing? Are there patches that should be committed (meet coding standards, are back-compatible, generally useful, etc.) but are not? A great patch is one that a committer can commit it with few worries: it includes new unit tests, it passes all existing unit tests, it fixes one thing only, etc. Such patches should not have to wait long for commit. And once someone submits a few such patches, then one should be invited to become a committer.

It sounds to me like the problem is that, off-the-shelf, Nutch does not yet solve all the problems folks would like it too: e.g., it has never done a good job with incremental indexing. Folks see progress made on scalability, but really wish it were making more progress on incrementality or something else. But it's not going to make progress on incrementality without someone doing the work. A fork or a branch isn't going to do the work. I don't see any reason that the work cannot be done right now. It can be done incrementally: e.g., if the web db API seems inappropriate for incremental updates, then someone should submit a patch that provides an incremental web db API, updating the fetcher and indexer to use this. A design for this on the wiki would be a good place to start.

Finally, web crawling, indexing and searching are data-intensive. Before long, users will want to index tens or hundreds of millions of pages. Distributed operation is soon required at this scale, and batch-mode is an order-of-magnitude faster. So be careful before you threw those features out: you might want them back soon.

Doug


Reply via email to