On 1/22/07, Doug Cutting <[EMAIL PROTECTED]> wrote:


Finally, web crawling, indexing and searching are data-intensive.
Before long, users will want to index tens or hundreds of millions of
pages.  Distributed operation is soon required at this scale, and
batch-mode is an order-of-magnitude faster.  So be careful before you
threw those features out: you might want them back soon.

Doug


As a developer building application on top of Nutch, my experience is that
I can't go back to version 0.7x because the features in version 0.8/0.9 are
so much needed even for non-distributed crawling/indexing. For example, I
can run crawling/indexing on a linux server and a windows laptop separately,
and merge newly crawled databases into the main crawldb. I remember
v0.7can't merge separate crawldb without lots of customization.

It may takes some time to switch from 0.7x to v0.8/0.9 especially if you
have lots of customization code. But, once you get over this one hurdle, you
will enjoy the new and better features in 0.8/0.9 version.  Also, this may
be the time to re-think the design of your application. For my own project,
I always try to separate my code from nutch core code as much as possible so
that I can easily upgrade the application to keep up with new nutch release.
Keeping away from the newest nutch version is somewhat backward to me.

AJ
--
AJ Chen, PhD
Palo Alto, CA
http://web2express.org

Reply via email to