Re: Reviving Nutch 0.7

Doug Cutting Mon, 22 Jan 2007 10:40:56 -0800

[EMAIL PROTECTED] wrote:

Yes, certainly, anything that can be shared and decoupled from pieces that make 
each branch (not SVN/CVS branch) different, should be decoupled.  But I was 
really curious about whether people think this is a valid idea/direction, not 
necessarily immediately how things should be implemented.  In my mind, one 
branch is the branch that runs on top of Hadoop, with NameNode, DataNode, HDFS, 
etc.  That's the branch that's in the trunk.  The other branch is a simpler 
branch without all that Hadoop stuff, for folks who need to fetch, index, and 
search a few hundred thousand or a few million or even a few tens of millions 
of pages, and don't need replication, etc. that comes with Hadoop.  That branch 
could be based off of 0.7.  I also know that a lot of people are trying to use 
Nutch to build vertical search engines, so there is also a need for a focused 
fetcher.  Kelvin Tan brought this up a few times, too, I believe.


Branching doesn't sound like the right solution here.

First, one doesn't need to run any Hadoop daemons to use Nutch:everything should run fine in a single process by default. If there arebugs in this they should be logged, folks who care should submithigh-quality, back-compatible, generally useful patches, and committersshould work to get these patches committed to the trunk.

Second, if there are to be two modes of operation, wouldn't they best bedeveloped in a common source tree, so that they share as much aspossible and diverge as little as possible? It seems to me that a goodarchitecture would be to agree on a common high-level API, then use twodifferent runtimes underneath, one to support distributed operation, andone to support standalone operation. Hey! That's what Hadoop alreadydoes! Maybe it's not perfect and someone can propose a better way toshare maximal amounts of code, but the code split should probably beinto different classes and packages in a single source tree maintainedby a single community of developers, not by branching a single sourcetree in a revision control and splitting the developers.

Third, part of the problem seems like there are two fewcontributors--that the challenges are big and the resources limited.Splitting the project will only spread those resources more thinly.

What really is the issue here? Are good patches languishing? Are therepatches that should be committed (meet coding standards, areback-compatible, generally useful, etc.) but are not? A great patch isone that a committer can commit it with few worries: it includes newunit tests, it passes all existing unit tests, it fixes one thing only,etc. Such patches should not have to wait long for commit. And oncesomeone submits a few such patches, then one should be invited to becomea committer.

It sounds to me like the problem is that, off-the-shelf, Nutch does notyet solve all the problems folks would like it too: e.g., it has neverdone a good job with incremental indexing. Folks see progress made onscalability, but really wish it were making more progress onincrementality or something else. But it's not going to make progresson incrementality without someone doing the work. A fork or a branchisn't going to do the work. I don't see any reason that the work cannotbe done right now. It can be done incrementally: e.g., if the web dbAPI seems inappropriate for incremental updates, then someone shouldsubmit a patch that provides an incremental web db API, updating thefetcher and indexer to use this. A design for this on the wiki would bea good place to start.

Finally, web crawling, indexing and searching are data-intensive.Before long, users will want to index tens or hundreds of millions ofpages. Distributed operation is soon required at this scale, andbatch-mode is an order-of-magnitude faster. So be careful before youthrew those features out: you might want them back soon.


Doug

Re: Reviving Nutch 0.7

Reply via email to