+1 to the GUI comment, even though I haven't made one yet, it's definitely on my list of items should I find the cycles to do more besides releasing.
Thanks! Cheers, Chris On Nov 15, 2011, at 1:01 PM, Markus Jelsma wrote: >> Hi Guys, >> >> During ApacheCon I made a point of trying to gauge how people that used >> Nutch found it. From the outset I would like to say that my reasoning >> behind this exercise was not to pick holes in the work that we put in to >> the project as a community, the great ideas, improvements and subsequently >> Apache product which we develop and maintain is a fantastic piece of >> software. I thought it could benefit us if we could get, at least a few >> comments regarding users experience. Here's one for starters >> ------------------------------------------- >> Hi Lewis, >> >> Thank you for contacting us regarding Apache Nutch. Yes, we have been >> using Nutch for web crawling, and thank you for making it possible! We >> will gladly share our opinions and comments with you. Here is several >> items that we like and some that we would like to see addressed in future >> Nutch development. >> >> What we like about Nutch: >> >> 1. Open source, Apache license >> 2. Integrates with Solr >> 3. Modular architecture, we are a development shop and value the >> extendability the most >> 4. Plans for 2.0 to remove search and index from Nutch and only focus on >> crawling > > Clearly good points indeed. > >> >> What we do not like about Nutch: >> >> 1. Lack of incremental index update, needs twice the storage to build a >> new index (will go away in 2.0) > > I'm not sure what he/she means. The index is in Solr. Perhaps he/she works > with old Nutch? > >> 2. Integration with Hadoop FS, it takes disproportional/large amount of >> space to do segment merging or indexing > > Seems like old Nutch indeed with embedded Lucene. Segments merging is not > something that is required anymore but may be useful from a maintenenace > point > of view, not for daily operations. > >> 3. Unstable, out of memory exceptions on large crawls during segment >> merging or indexing, worker threads hang occasionally > > OOM's are indeed a possibility, we also sometimes suffer from this. However, > if one calculates worst case scenario you will most likely never run OOM > during fetch, parse or indexing. We rely on good distribution of pages and > our > average heap consumption is just right, except once in a while ;) > > The problem is that handling and recovering from OOM is extremely difficult > if > not impossible. > >> 4. Lack of GUI/web management/reporting > > Well, i never have and still don't see any useful case for some GUI. It's a > complex package of many jobs. What would one want to manage through a GUI? > >> >> We hope our comments will help you to continue making Nutch an even better >> Web crawler. > > Interesting, i'd like to hear more if there is any. > > Thanks > >> ------------------------------------------------------------------- >> Any comments guys? I've already explained to the guy that his first point >> 4. has been fully addressed in 1.3 onwards. I am curious to get you guys >> opinions on the rest fo the stuff (over and above the obvious GUI/web >> management/reporting) stuff. >> >> Thank you. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

