Community Comments

Lewis John Mcgibbney Tue, 15 Nov 2011 12:50:11 -0800

Hi Guys,

During ApacheCon I made a point of trying to gauge how people that used
Nutch found it. From the outset I would like to say that my reasoning
behind this exercise was not to pick holes in the work that we put in to
the project as a community, the great ideas, improvements and subsequently
Apache product which we develop and maintain is a fantastic piece of
software. I thought it could benefit us if we could get, at least a few
comments regarding users experience. Here's one for starters
-------------------------------------------
Hi Lewis,


Thank you for contacting us regarding Apache Nutch. Yes, we have been
using Nutch for web crawling, and thank you for making it possible! We
will gladly share our opinions and comments with you. Here is several
items that we like and some that we would like to see addressed in future
Nutch development.

What we like about Nutch:

1. Open source, Apache license
2. Integrates with Solr
3. Modular architecture, we are a development shop and value the
extendability the most
4. Plans for 2.0 to remove search and index from Nutch and only focus on
crawling

What we do not like about Nutch:

1. Lack of incremental index update, needs twice the storage to build a
new index (will go away in 2.0)
2. Integration with Hadoop FS, it takes disproportional/large amount of
space to do segment merging or indexing
3. Unstable, out of memory exceptions on large crawls during segment
merging or indexing, worker threads hang occasionally
4. Lack of GUI/web management/reporting

We hope our comments will help you to continue making Nutch an even better
Web crawler.
-------------------------------------------------------------------
Any comments guys? I've already explained to the guy that his first point
4. has been fully addressed in 1.3 onwards. I am curious to get you guys
opinions on the rest fo the stuff (over and above the obvious GUI/web
management/reporting) stuff.

Thank you.

-- 
*Lewis*

Community Comments

Reply via email to