Re: Community Comments

Markus Jelsma Tue, 15 Nov 2011 13:02:15 -0800

> Hi Guys,
> 
> During ApacheCon I made a point of trying to gauge how people that used
> Nutch found it. From the outset I would like to say that my reasoning
> behind this exercise was not to pick holes in the work that we put in to
> the project as a community, the great ideas, improvements and subsequently
> Apache product which we develop and maintain is a fantastic piece of
> software. I thought it could benefit us if we could get, at least a few
> comments regarding users experience. Here's one for starters
> -------------------------------------------
> Hi Lewis,
> 
> Thank you for contacting us regarding Apache Nutch. Yes, we have been
> using Nutch for web crawling, and thank you for making it possible! We
> will gladly share our opinions and comments with you. Here is several
> items that we like and some that we would like to see addressed in future
> Nutch development.
> 
> What we like about Nutch:
> 
> 1. Open source, Apache license
> 2. Integrates with Solr
> 3. Modular architecture, we are a development shop and value the
> extendability the most
> 4. Plans for 2.0 to remove search and index from Nutch and only focus on
> crawling


Clearly good points indeed.

> 
> What we do not like about Nutch:
> 
> 1. Lack of incremental index update, needs twice the storage to build a
> new index (will go away in 2.0)

I'm not sure what he/she means. The index is in Solr. Perhaps he/she works 
with old Nutch?

> 2. Integration with Hadoop FS, it takes disproportional/large amount of
> space to do segment merging or indexing

Seems like old Nutch indeed with embedded Lucene. Segments merging is not 
something that is required anymore but may be useful from a maintenenace point 
of view, not for daily operations.

> 3. Unstable, out of memory exceptions on large crawls during segment
> merging or indexing, worker threads hang occasionally

OOM's are indeed a possibility, we also sometimes suffer from this. However, 
if one calculates worst case scenario you will most likely never run OOM 
during fetch, parse or indexing. We rely on good distribution of pages and our 
average heap consumption is just right, except once in a while ;)

The problem is that handling and recovering from OOM is extremely difficult if 
not impossible.

> 4. Lack of GUI/web management/reporting

Well, i never have and still don't see any useful case for some GUI. It's a 
complex package of many jobs. What would one want to manage through a GUI? 

> 
> We hope our comments will help you to continue making Nutch an even better
> Web crawler.

Interesting, i'd like to hear more if there is any.

Thanks

> -------------------------------------------------------------------
> Any comments guys? I've already explained to the guy that his first point
> 4. has been fully addressed in 1.3 onwards. I am curious to get you guys
> opinions on the rest fo the stuff (over and above the obvious GUI/web
> management/reporting) stuff.
> 
> Thank you.

Re: Community Comments

Reply via email to