Hello, I'm writing this on behalf of both Armel Nene and myself.
We think that you and those who have responded have a point. We've been experiencing quite a number of problems with getting Nutch 0.8 adapted for our needs, and making changes to support evolving business requirements as they come up. So much so, that we've considered replacing the "spine" of Nutch with our own programs, which would still be compatible with the Nutch plugins (same parameters etc.), but that would allow us more ease in making changes and debug. We've decided to lay out some of our challenges for you to consider. Our major needs are the ability to deploy on large enterprise file systems (1-10 Terabytes, large compared to average file systems, but small compared to the WWW). We also need to support http, but only specific web sites, subscription web sites and so on. We don't need to replicate a generic-Google implementation. The main features we are currently working on relate primarily to near-real-time crawling, specifically: - Incremental Crawling, where changes are monitored at the folder level, which is much faster than fetching every URL and checking for a change. Note that this is similar to adaptive crawling, but will be even more efficient. - Special handling for parsing of large files (possibly farming those out to dedicated processors a-la Amazon). Hadoop would be useful here, but we would consider re-adding this at a later stage. - Incremental Indexing, where documents are added to or removed from a live index, instead of rebuilding a new index each time. We would be happy to join a group of 0.7 developers, if that would enable us to pursue this enterprise-based direction, which clearly has different challenges than those facing WWW-crawling. Best regards, Alan _________________________ Alan Tanaman iDNA Solutions http://blog.idna-solutions.com -----Original Message----- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 22 January 2007 06:48 To: Nutch Developer List Subject: Reviving Nutch 0.7 Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be. Fairly regular nutch-user inquiries confirm this. Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond. So, what is going to happen to 0.7? Maintenance mode? I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward. Thoughts? Otis