Hi Phil,

Thanks very much for the feedback. I¹d like to take a second to address your
points:

> 
> How do you test to see if Nutch works like the documentation says it works?
> I still find major differences between how existing documentation tells me,
> a newcomer to the project, how to get it running.

Unfortunately some parts of the documentation on Nutch (namely the tutorial,
and other parts of the static site) have been out of date for a while. This
has occurred really independent of the releases, and independent of the wiki
[1], which hasn't really fallen out of date as quick.

> 
> For example, my find of broken code in bin/nutch crawl, a most basic way of
> getting it running.

Can you elaborate on your find of broken code? Did you file a JIRA issue for
this in the Nutch JIRA system [2] ?

> 
> And I have yet to get the deepcrawl script which seems to be the suggestion
> of how to get beyond bin/nutch crawl. It doesn't return any data at all and
> has an error in the middle of it's run regarding missing file which the last
> stage apparently failed to write. (I believe because the scheduler excluded
> everything)

The more information you provide here about your environment and your
situation that caused the error, as well as e.g., detailed information (a
stack trace, an exception, something), the easier it is to track down what
you're seeing.

> 
> I wonder if the developers have advanced so far past these basic scripts as
> to have pretty much left them behind. This leads to these basics that people
> start with not working.

I wouldn't say developers have advanced beyond anything really for that
matter :) The number of active developers in Nutch these days is fairly
small, but interest and the user community is stable and there are some
pretty large scale deployments of Nutch to my knowledge. That said, those
folks have been following the mailing lists for a while, have been using the
software for a while and thus their level of entry into the documentation
may be at a little higher bar than that of a newer user such as yourself.

That said, one thing to realize is that this is open source software, so in
the end, as they say in Apache, "those that do, decide", or "patches
welcome!" In other words, if there are things that you see that could be
fixed, improved, made more configurable, etc., including the code, but *also
the documentation*, then by all means we'd appreciate your feedback and
contribution. Nutch is not simply a product of the developers that
contribute their (potentially and often unsalaried) time to work on it, but
of its user community as well.

> 
> I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and
> I'm getting nowhere at all. It's pretty frustrating to spend that much time
> trying to figure out how it works and keep hitting walls. And then asking
> basic questions here that go unanswered.

I apologize that your questions have gone unanswered and that you're hitting
walls with regards to using Nutch. What questions did you ask? Perhaps it's
the detail that you are providing (or not providing), or perhaps it's the
way you're asking the questions. Or (even more likely) it's the fact that
this is an open source project and thus the committers get around to user
emails lists as one of the multiple items on their plate that they are
working on the project and us committers may have missed your question, or
perhaps those that had the time weren't particular experts in the one area
of Nutch that you were asking about. There could be a number of reasons.
Regardless, persistence is key as is *patience* and respectfulness. This has
always to my knowledge been a really friendly community, so if you hang
around and keep asking questions they will get answered I'm confident of
that.

> 
> The view from the outside is not so good from my direction. If you don't
> keep documentation up to date and you change the way things work, the
> project as seen from the outside, is plainly broken.

In certain cases you are right, but I would take your above comments as
verbatim across the board. For example, if you believe there is
documentation lacking, then the first step is typically to file JIRA issues
to alert committers and other users of Nutch of your concern and then have
discussion on the lists regarding the issues. At some point a patch is
produced, and then attached to the issue, where the committers can review
the patches and then work to get them committed to the code base.

Nutch has a number of unit tests for regression that ship with the product
that tell me that it's not broken, and users that are able to make it work
in their environments. There have been some recent bug fixes in the 1.1 RC
that we caught which have been fixed (NUTCH-812, NUTCH-814, etc.), but
that's natural.  

> 
> I'd be happy to give you feedback on where I find these problems and I'll
> even donate whatever fixes I can come up with, but Java is not a language
> I'm familiar with and going is slow weeding through things. I really need
> this project to work for me. I want to help.

There are other ways to contribute to the project besides coding - I just
thought of a really good reference that you can read in this regard put
together by Dennis Kubes, one of the Nutch committers and PMC members. Check
this out [3]. You may also want to check out our FAQ [4].

> 
> 1. Where is the scheduler documented? If I want to crawl everything from
> scratch, where is the information from the last run stored? It seems like
> the schedule is telling my crawl to ignore pages due to scheduler knocking
> them out. It's not obvious to my why this is happening and how to stop it
> from happening. I think right now this is my major roadblock in getting
> bin/nutch crawl working. Maybe the scheduler code no longer works properly
> in bin/nutch crawl. I can't tell if it's that or if the default
> configurations don't work.

I think you might be talking about the Fetcher: there is documentation of it
here:

http://bit.ly/alqFoA
http://wiki.apache.org/nutch/FetchOptions
http://wiki.apache.org/nutch/CommandLineOptions

> 
> 2, Where are the control files in conf documented? How do I know which ones
> do what and when? There's a half dozen *-urlfilters. Why?

Some of these are admittedly newer features but others are not:

http://wiki.apache.org/nutch/RegexURLFiltersBenchs
http://bit.ly/b99NLK

> 
> 3. Why doesn't your post nightly compile tests include bin/nutch crawl or if
> it does, why didn't it find the error that stopped it from running?

Good question. I'm not super familiar with the nightly tests, but my guess
is that the scripts are outside the context of the tests since most of the
tests use Junit and are testing the Java API and classes. I may be wrong
though.

> 
> 4. Where is the documentation on how to configure the new tika parser in
> your environment? I see that the old parsers have been removed by default,
> but there's nothing that shows me how to include/exclude document types.

Julien Nioche put this together on the TikaPlugin:

http://wiki.apache.org/nutch/TikaPlugin
> 
> I believe your assessment of 'ready' is not inclusive of some very important
> things and that you would be doing a service to newcomers to bring
> documentation in line with current offerings. This is not trivial code and
> it takes a long time for someone from the outside to understand it. That
> process is being stifled on multiple fronts as far as I can see. Either that
> or I have missed an important document that exists and I haven't read it.

Ready in the sense of the release is a consensus decision made by the
developers and community based on a variety of things:

* issues being resolved in JIRA of a particular priority
* time in-between last release
* community requesting a release
* according to some pre-defined schedule
* making a feature release to get out new interesting features
etc etc.

I'm sorry that you are experiencing problems, and our goal is to try and
address as many as possible and prioritize them, but in the end, Apache has
a process regarding releases [5], which is based somewhat on input from the
community (usually in the form of simple majority), but ultimately based on
a Project Management Committee [6] structure, whose votes are binding on a
particular release.

I hope that we can work with you to continue to use Nutch and make it useful
in your environment, but in the meanwhile, I would suggest you keep plugging
along, continue to push forward and check out some of the references I
included in this email moving forward.

Thanks!

Cheers,
Chris


[1] http://wiki.apache.org/nutch/
[2] http://issues.apache.org/jira/browse/NUTCH
[3] http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
[4] http://wiki.apache.org/nutch/FAQ
[5] http://www.apache.org/foundation/voting.html
[6] http://www.apache.org/dev/pmc.html

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Reply via email to