Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Phil Barnett Wed, 28 Apr 2010 04:00:14 -0700

On Mon, Apr 26, 2010 at 1:55 AM, Mattmann, Chris A (388J) <
[email protected]> wrote:


>
> Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> open for the next 72 hours.
>

How do you test to see if Nutch works like the documentation says it works?
I still find major differences between how existing documentation tells me,
a newcomer to the project, how to get it running.

For example, my find of broken code in bin/nutch crawl, a most basic way of
getting it running.

And I have yet to get the deepcrawl script which seems to be the suggestion
of how to get beyond bin/nutch crawl. It doesn't return any data at all and
has an error in the middle of it's run regarding missing file which the last
stage apparently failed to write. (I believe because the scheduler excluded
everything)

I wonder if the developers have advanced so far past these basic scripts as
to have pretty much left them behind. This leads to these basics that people
start with not working.

I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and
I'm getting nowhere at all. It's pretty frustrating to spend that much time
trying to figure out how it works and keep hitting walls. And then asking
basic questions here that go unanswered.

The view from the outside is not so good from my direction. If you don't
keep documentation up to date and you change the way things work, the
project as seen from the outside, is plainly broken.

I'd be happy to give you feedback on where I find these problems and I'll
even donate whatever fixes I can come up with, but Java is not a language
I'm familiar with and going is slow weeding through things. I really need
this project to work for me. I want to help.

1. Where is the scheduler documented? If I want to crawl everything from
scratch, where is the information from the last run stored? It seems like
the schedule is telling my crawl to ignore pages due to scheduler knocking
them out. It's not obvious to my why this is happening and how to stop it
from happening. I think right now this is my major roadblock in getting
bin/nutch crawl working. Maybe the scheduler code no longer works properly
in bin/nutch crawl. I can't tell if it's that or if the default
configurations don't work.

2, Where are the control files in conf documented? How do I know which ones
do what and when? There's a half dozen *-urlfilters. Why?

3. Why doesn't your post nightly compile tests include bin/nutch crawl or if
it does, why didn't it find the error that stopped it from running?

4. Where is the documentation on how to configure the new tika parser in
your environment? I see that the old parsers have been removed by default,
but there's nothing that shows me how to include/exclude document types.

I believe your assessment of 'ready' is not inclusive of some very important
things and that you would be doing a service to newcomers to bring
documentation in line with current offerings. This is not trivial code and
it takes a long time for someone from the outside to understand it. That
process is being stifled on multiple fronts as far as I can see. Either that
or I have missed an important document that exists and I haven't read it.

Phil Barnett
Senior Programmer / Analyst
Walt Disney World, Inc.

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Reply via email to