Hi Guys, During ApacheCon I made a point of trying to gauge how people that used Nutch found it. From the outset I would like to say that my reasoning behind this exercise was not to pick holes in the work that we put in to the project as a community, the great ideas, improvements and subsequently Apache product which we develop and maintain is a fantastic piece of software. I thought it could benefit us if we could get, at least a few comments regarding users experience. Here's one for starters ------------------------------------------- Hi Lewis,
Thank you for contacting us regarding Apache Nutch. Yes, we have been using Nutch for web crawling, and thank you for making it possible! We will gladly share our opinions and comments with you. Here is several items that we like and some that we would like to see addressed in future Nutch development. What we like about Nutch: 1. Open source, Apache license 2. Integrates with Solr 3. Modular architecture, we are a development shop and value the extendability the most 4. Plans for 2.0 to remove search and index from Nutch and only focus on crawling What we do not like about Nutch: 1. Lack of incremental index update, needs twice the storage to build a new index (will go away in 2.0) 2. Integration with Hadoop FS, it takes disproportional/large amount of space to do segment merging or indexing 3. Unstable, out of memory exceptions on large crawls during segment merging or indexing, worker threads hang occasionally 4. Lack of GUI/web management/reporting We hope our comments will help you to continue making Nutch an even better Web crawler. ------------------------------------------------------------------- Any comments guys? I've already explained to the guy that his first point 4. has been fully addressed in 1.3 onwards. I am curious to get you guys opinions on the rest fo the stuff (over and above the obvious GUI/web management/reporting) stuff. Thank you. -- *Lewis*

