On Mon, Apr 26, 2010 at 1:55 AM, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote:
> > Please vote on releasing these packages as Apache Nutch 1.1. The vote is > open for the next 72 hours. > How do you test to see if Nutch works like the documentation says it works? I still find major differences between how existing documentation tells me, a newcomer to the project, how to get it running. For example, my find of broken code in bin/nutch crawl, a most basic way of getting it running. And I have yet to get the deepcrawl script which seems to be the suggestion of how to get beyond bin/nutch crawl. It doesn't return any data at all and has an error in the middle of it's run regarding missing file which the last stage apparently failed to write. (I believe because the scheduler excluded everything) I wonder if the developers have advanced so far past these basic scripts as to have pretty much left them behind. This leads to these basics that people start with not working. I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and I'm getting nowhere at all. It's pretty frustrating to spend that much time trying to figure out how it works and keep hitting walls. And then asking basic questions here that go unanswered. The view from the outside is not so good from my direction. If you don't keep documentation up to date and you change the way things work, the project as seen from the outside, is plainly broken. I'd be happy to give you feedback on where I find these problems and I'll even donate whatever fixes I can come up with, but Java is not a language I'm familiar with and going is slow weeding through things. I really need this project to work for me. I want to help. 1. Where is the scheduler documented? If I want to crawl everything from scratch, where is the information from the last run stored? It seems like the schedule is telling my crawl to ignore pages due to scheduler knocking them out. It's not obvious to my why this is happening and how to stop it from happening. I think right now this is my major roadblock in getting bin/nutch crawl working. Maybe the scheduler code no longer works properly in bin/nutch crawl. I can't tell if it's that or if the default configurations don't work. 2, Where are the control files in conf documented? How do I know which ones do what and when? There's a half dozen *-urlfilters. Why? 3. Why doesn't your post nightly compile tests include bin/nutch crawl or if it does, why didn't it find the error that stopped it from running? 4. Where is the documentation on how to configure the new tika parser in your environment? I see that the old parsers have been removed by default, but there's nothing that shows me how to include/exclude document types. I believe your assessment of 'ready' is not inclusive of some very important things and that you would be doing a service to newcomers to bring documentation in line with current offerings. This is not trivial code and it takes a long time for someone from the outside to understand it. That process is being stifled on multiple fronts as far as I can see. Either that or I have missed an important document that exists and I haven't read it. Phil Barnett Senior Programmer / Analyst Walt Disney World, Inc.