Hi Matthew, Thanks for your feedback. If you have any specific updates/improvements/actionable items based on your comments below, we'd love to have you contribute them back in the form of contributions to the community. Otherwise, we will take your feedback, put it into the queue of other items in the Nutch issue tracking system for those who are committers on the project to work on, as time permits.
Apache has a process for meritocracy  in terms of contributing to projects and being recognized for those contributions - we welcome feedback and actionable things in the forms of patches that improve the code, documentation, add new features, etc., while maintaining backwards compatibility with existing deployments and existing users. Thanks and hope to see some issues/feedback/patches continue to come! Cheers, Chris  http://www.apache.org/foundation/how-it-works.html#meritocracy On 4/28/10 7:27 AM, "matthew a. grisius" <mgris...@comcast.net> wrote: I also share many of Phil's sentiments. I really want the project (bin/nutch crawl) to work for me as well and I want to help somehow. I would like to share a 5gb 'intranet' web site with ~50 people. And I have not graduated to making the 'deepcrawl' script work yet either, as I'm thinking that maybe Nutch might not be the 'right tool' for 'little projects' based on documentation, discussion list feedback, etc. . . . -m. On Wed, 2010-04-28 at 06:59 -0400, Phil Barnett wrote: > On Mon, Apr 26, 2010 at 1:55 AM, Mattmann, Chris A (388J) < > chris.a.mattm...@jpl.nasa.gov> wrote: > > > > > Please vote on releasing these packages as Apache Nutch 1.1. The vote is > > open for the next 72 hours. > > > > How do you test to see if Nutch works like the documentation says it works? > I still find major differences between how existing documentation tells me, > a newcomer to the project, how to get it running. > > For example, my find of broken code in bin/nutch crawl, a most basic way of > getting it running. > > And I have yet to get the deepcrawl script which seems to be the suggestion > of how to get beyond bin/nutch crawl. It doesn't return any data at all and > has an error in the middle of it's run regarding missing file which the last > stage apparently failed to write. (I believe because the scheduler excluded > everything) > > I wonder if the developers have advanced so far past these basic scripts as > to have pretty much left them behind. This leads to these basics that people > start with not working. > > I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and > I'm getting nowhere at all. It's pretty frustrating to spend that much time > trying to figure out how it works and keep hitting walls. And then asking > basic questions here that go unanswered. > > The view from the outside is not so good from my direction. If you don't > keep documentation up to date and you change the way things work, the > project as seen from the outside, is plainly broken. > > I'd be happy to give you feedback on where I find these problems and I'll > even donate whatever fixes I can come up with, but Java is not a language > I'm familiar with and going is slow weeding through things. I really need > this project to work for me. I want to help. > > 1. Where is the scheduler documented? If I want to crawl everything from > scratch, where is the information from the last run stored? It seems like > the schedule is telling my crawl to ignore pages due to scheduler knocking > them out. It's not obvious to my why this is happening and how to stop it > from happening. I think right now this is my major roadblock in getting > bin/nutch crawl working. Maybe the scheduler code no longer works properly > in bin/nutch crawl. I can't tell if it's that or if the default > configurations don't work. > > 2, Where are the control files in conf documented? How do I know which ones > do what and when? There's a half dozen *-urlfilters. Why? > > 3. Why doesn't your post nightly compile tests include bin/nutch crawl or if > it does, why didn't it find the error that stopped it from running? > > 4. Where is the documentation on how to configure the new tika parser in > your environment? I see that the old parsers have been removed by default, > but there's nothing that shows me how to include/exclude document types. > > I believe your assessment of 'ready' is not inclusive of some very important > things and that you would be doing a service to newcomers to bring > documentation in line with current offerings. This is not trivial code and > it takes a long time for someone from the outside to understand it. That > process is being stifled on multiple fronts as far as I can see. Either that > or I have missed an important document that exists and I haven't read it. > > Phil Barnett > Senior Programmer / Analyst > Walt Disney World, Inc. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++