[EMAIL PROTECTED] wrote:
Yes, certainly, anything that can be shared and decoupled from pieces that make
each branch (not SVN/CVS branch) different, should be decoupled. But I was
really curious about whether people think this is a valid idea/direction, not
necessarily immediately how things should be implemented. In my mind, one
branch is the branch that runs on top of Hadoop, with NameNode, DataNode, HDFS,
etc. That's the branch that's in the trunk. The other branch is a simpler
branch without all that Hadoop stuff, for folks who need to fetch, index, and
search a few hundred thousand or a few million or even a few tens of millions
of pages, and don't need replication, etc. that comes with Hadoop. That branch
could be based off of 0.7. I also know that a lot of people are trying to use
Nutch to build vertical search engines, so there is also a need for a focused
fetcher. Kelvin Tan brought this up a few times, too, I believe.
Branching doesn't sound like the right solution here.
First, one doesn't need to run any Hadoop daemons to use Nutch:
everything should run fine in a single process by default. If there are
bugs in this they should be logged, folks who care should submit
high-quality, back-compatible, generally useful patches, and committers
should work to get these patches committed to the trunk.
Second, if there are to be two modes of operation, wouldn't they best be
developed in a common source tree, so that they share as much as
possible and diverge as little as possible? It seems to me that a good
architecture would be to agree on a common high-level API, then use two
different runtimes underneath, one to support distributed operation, and
one to support standalone operation. Hey! That's what Hadoop already
does! Maybe it's not perfect and someone can propose a better way to
share maximal amounts of code, but the code split should probably be
into different classes and packages in a single source tree maintained
by a single community of developers, not by branching a single source
tree in a revision control and splitting the developers.
Third, part of the problem seems like there are two few
contributors--that the challenges are big and the resources limited.
Splitting the project will only spread those resources more thinly.
What really is the issue here? Are good patches languishing? Are there
patches that should be committed (meet coding standards, are
back-compatible, generally useful, etc.) but are not? A great patch is
one that a committer can commit it with few worries: it includes new
unit tests, it passes all existing unit tests, it fixes one thing only,
etc. Such patches should not have to wait long for commit. And once
someone submits a few such patches, then one should be invited to become
a committer.
It sounds to me like the problem is that, off-the-shelf, Nutch does not
yet solve all the problems folks would like it too: e.g., it has never
done a good job with incremental indexing. Folks see progress made on
scalability, but really wish it were making more progress on
incrementality or something else. But it's not going to make progress
on incrementality without someone doing the work. A fork or a branch
isn't going to do the work. I don't see any reason that the work cannot
be done right now. It can be done incrementally: e.g., if the web db
API seems inappropriate for incremental updates, then someone should
submit a patch that provides an incremental web db API, updating the
fetcher and indexer to use this. A design for this on the wiki would be
a good place to start.
Finally, web crawling, indexing and searching are data-intensive.
Before long, users will want to index tens or hundreds of millions of
pages. Distributed operation is soon required at this scale, and
batch-mode is an order-of-magnitude faster. So be careful before you
threw those features out: you might want them back soon.
Doug