Hello,

I'm writing this on behalf of both Armel Nene and myself. 

We think that you and those who have responded have a point.  We've been
experiencing quite a number of problems with getting Nutch 0.8 adapted for
our needs, and making changes to support evolving business requirements as
they come up.

So much so, that we've considered replacing the "spine" of Nutch with our
own programs, which would still be compatible with the Nutch plugins (same
parameters etc.), but that would allow us more ease in making changes and
debug.  We've decided to lay out some of our challenges for you to consider.
 
Our major needs are the ability to deploy on large enterprise file systems
(1-10 Terabytes, large compared to average file systems, but small compared
to the WWW).  We also need to support http, but only specific web sites,
subscription web sites and so on.  We don't need to replicate a
generic-Google implementation.

The main features we are currently working on relate primarily to
near-real-time crawling, specifically:
- Incremental Crawling, where changes are monitored at the folder level,
which is much faster than fetching every URL and checking for a change.
Note that this is similar to adaptive crawling, but will be even more
efficient.
- Special handling for parsing of large files (possibly farming those out to
dedicated processors a-la Amazon).  Hadoop would be useful here, but we
would consider re-adding this at a later stage.
- Incremental Indexing, where documents are added to or removed from a live
index, instead of rebuilding a new index each time.

We would be happy to join a group of 0.7 developers, if that would enable us
to pursue this enterprise-based direction, which clearly has different
challenges than those facing WWW-crawling.

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
http://blog.idna-solutions.com

-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: 22 January 2007 06:48
To: Nutch Developer List
Subject: Reviving Nutch 0.7

Hi,

I've been meaning to write this message for a while, and Andrzej's
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop
stabilizes, it will be even more valuable than it is today.  However, I
think there is still a need for something much simpler, something like what
Nutch 0.7 used to be.  Fairly regular nutch-user inquiries confirm this.
Nutch has too few developers to maintain and further develop both of these
concepts, and the main Nutch developers need the more powerful version - 0.8
and beyond.  So, what is going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be worth
at least considering and discussing the possibility of somehow branching
that version into a parallel project that's not just in a maintenance mode,
but has its own group of developers (not me, no time :( ) that pushes it
forward.

Thoughts?

Otis




Reply via email to