+1 to the GUI comment, even though I haven't made one yet, it's definitely on 
my list of items should I find the cycles to do more besides releasing.

Thanks!

Cheers,
Chris

On Nov 15, 2011, at 1:01 PM, Markus Jelsma wrote:

>> Hi Guys,
>> 
>> During ApacheCon I made a point of trying to gauge how people that used
>> Nutch found it. From the outset I would like to say that my reasoning
>> behind this exercise was not to pick holes in the work that we put in to
>> the project as a community, the great ideas, improvements and subsequently
>> Apache product which we develop and maintain is a fantastic piece of
>> software. I thought it could benefit us if we could get, at least a few
>> comments regarding users experience. Here's one for starters
>> -------------------------------------------
>> Hi Lewis,
>> 
>> Thank you for contacting us regarding Apache Nutch. Yes, we have been
>> using Nutch for web crawling, and thank you for making it possible! We
>> will gladly share our opinions and comments with you. Here is several
>> items that we like and some that we would like to see addressed in future
>> Nutch development.
>> 
>> What we like about Nutch:
>> 
>> 1. Open source, Apache license
>> 2. Integrates with Solr
>> 3. Modular architecture, we are a development shop and value the
>> extendability the most
>> 4. Plans for 2.0 to remove search and index from Nutch and only focus on
>> crawling
> 
> Clearly good points indeed.
> 
>> 
>> What we do not like about Nutch:
>> 
>> 1. Lack of incremental index update, needs twice the storage to build a
>> new index (will go away in 2.0)
> 
> I'm not sure what he/she means. The index is in Solr. Perhaps he/she works 
> with old Nutch?
> 
>> 2. Integration with Hadoop FS, it takes disproportional/large amount of
>> space to do segment merging or indexing
> 
> Seems like old Nutch indeed with embedded Lucene. Segments merging is not 
> something that is required anymore but may be useful from a maintenenace 
> point 
> of view, not for daily operations.
> 
>> 3. Unstable, out of memory exceptions on large crawls during segment
>> merging or indexing, worker threads hang occasionally
> 
> OOM's are indeed a possibility, we also sometimes suffer from this. However, 
> if one calculates worst case scenario you will most likely never run OOM 
> during fetch, parse or indexing. We rely on good distribution of pages and 
> our 
> average heap consumption is just right, except once in a while ;)
> 
> The problem is that handling and recovering from OOM is extremely difficult 
> if 
> not impossible.
> 
>> 4. Lack of GUI/web management/reporting
> 
> Well, i never have and still don't see any useful case for some GUI. It's a 
> complex package of many jobs. What would one want to manage through a GUI? 
> 
>> 
>> We hope our comments will help you to continue making Nutch an even better
>> Web crawler.
> 
> Interesting, i'd like to hear more if there is any.
> 
> Thanks
> 
>> -------------------------------------------------------------------
>> Any comments guys? I've already explained to the guy that his first point
>> 4. has been fully addressed in 1.3 onwards. I am curious to get you guys
>> opinions on the rest fo the stuff (over and above the obvious GUI/web
>> management/reporting) stuff.
>> 
>> Thank you.


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to