Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Phil Barnett Fri, 30 Apr 2010 22:44:27 -0700

On Wed, Apr 28, 2010 at 11:01 AM, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:


>
> Unfortunately some parts of the documentation on Nutch (namely the
> tutorial,
> and other parts of the static site) have been out of date for a while. This
> has occurred really independent of the releases, and independent of the
> wiki
> [1], which hasn't really fallen out of date as quick.
>

While documentation may not be part of the code, it's certainly part of the
project. And it's just as important as the code. Yes, I know that
documentation is the bane of programmers everywhere. I'm a coder. I get it.
But when you change the way things work in a fundamental way that leaves all
of  your documentation behind, it's time to spend some time on it.


> >
> > For example, my find of broken code in bin/nutch crawl, a most basic way
> of
> > getting it running.
>
> Can you elaborate on your find of broken code? Did you file a JIRA issue
> for
> this in the Nutch JIRA system [2] ?
>

Yes, it led to another release. The bug fix I contributed was incorporated.

> And I have yet to get the deepcrawl script which seems to be the
> suggestion
> > of how to get beyond bin/nutch crawl. It doesn't return any data at all
> and
> > has an error in the middle of it's run regarding missing file which the
> last
> > stage apparently failed to write. (I believe because the scheduler
> excluded
> > everything)
>
> The more information you provide here about your environment and your
> situation that caused the error, as well as e.g., detailed information (a
> stack trace, an exception, something), the easier it is to track down what
> you're seeing.
>

Yes, that was all in the unanswered emails. it would be easier for you to
search your inbox than for me to send it all over again.


> > I wonder if the developers have advanced so far past these basic scripts
> as
> > to have pretty much left them behind. This leads to these basics that
> people
> > start with not working.
>
> I wouldn't say developers have advanced beyond anything really for that
> matter :) The number of active developers in Nutch these days is fairly
> small, but interest and the user community is stable and there are some
> pretty large scale deployments of Nutch to my knowledge. That said, those
> folks have been following the mailing lists for a while, have been using
> the
> software for a while and thus their level of entry into the documentation
> may be at a little higher bar than that of a newer user such as yourself.
>

bin/nutch crawl was plainly broken and it would never have worked for anyone
who tried it. 'nuff said.


> That said, one thing to realize is that this is open source software, so in
> the end, as they say in Apache, "those that do, decide", or "patches
> welcome!" In other words, if there are things that you see that could be
> fixed, improved, made more configurable, etc., including the code, but
> *also
> the documentation*, then by all means we'd appreciate your feedback and
> contribution. Nutch is not simply a product of the developers that
> contribute their (potentially and often unsalaried) time to work on it, but
> of its user community as well.
>

I've been the leader of a major open source project for over 10 years. Last
fall I relinquished the reins of that project to a new project leader, so I
think I know how it works. We wrote an open source cross platform compiler
for xBase (Clipper) code named Harbour Project, now in release 2.0.

That would be why I not only raised the flag that it's not ready to release,
but I tracked down a bug and submitted a bug fix.

And I'm still saying it's not ready to release. There's still another bug
that I have found that goes unanswered.


> > I've spend dozens of hours trying to get 1.1 to work anything like 1.0
> and
> > I'm getting nowhere at all. It's pretty frustrating to spend that much
> time
> > trying to figure out how it works and keep hitting walls. And then asking
> > basic questions here that go unanswered.
>
> I apologize that your questions have gone unanswered and that you're
> hitting
> walls with regards to using Nutch. What questions did you ask? Perhaps it's
> the detail that you are providing (or not providing), or perhaps it's the
> way you're asking the questions. Or (even more likely) it's the fact that
> this is an open source project and thus the committers get around to user
> emails lists as one of the multiple items on their plate that they are
> working on the project and us committers may have missed your question, or
> perhaps those that had the time weren't particular experts in the one area
> of Nutch that you were asking about. There could be a number of reasons.
> Regardless, persistence is key as is *patience* and respectfulness. This
> has
> always to my knowledge been a really friendly community, so if you hang
> around and keep asking questions they will get answered I'm confident of
> that.
>

Great. Now that it's out in the open, perhaps someone who does know about
the things I asked about can reply to my questions.


>  > The view from the outside is not so good from my direction. If you don't
> > keep documentation up to date and you change the way things work, the
> > project as seen from the outside, is plainly broken.
>
> In certain cases you are right, but I would take your above comments as
> verbatim across the board. For example, if you believe there is
> documentation lacking, then the first step is typically to file JIRA issues
> to alert committers and other users of Nutch of your concern and then have
> discussion on the lists regarding the issues. At some point a patch is
> produced, and then attached to the issue, where the committers can review
> the patches and then work to get them committed to the code base.
>
> Nutch has a number of unit tests for regression that ship with the product
> that tell me that it's not broken, and users that are able to make it work
> in their environments. There have been some recent bug fixes in the 1.1 RC
> that we caught which have been fixed (NUTCH-812, NUTCH-814, etc.), but
> that's natural.
>

No, not we. Me. I found a bug, told you about it and provided the fix.
Before I did that, I told you that your release candidate was broken. Just
like I'm still saying, unless I'm doing something grossly wrong, it's still
broken.


> > I'd be happy to give you feedback on where I find these problems and I'll
> > even donate whatever fixes I can come up with, but Java is not a language
> > I'm familiar with and going is slow weeding through things. I really need
> > this project to work for me. I want to help.
>
> There are other ways to contribute to the project besides coding - I just
> thought of a really good reference that you can read in this regard put
> together by Dennis Kubes, one of the Nutch committers and PMC members.
> Check
> this out [3]. You may also want to check out our FAQ [4].
>

Yes, I've read the faq. I've searched for answers in the documentation for
weeks. Once I did that and came to a dead end, I asked questions here.

> 1. Where is the scheduler documented? If I want to crawl everything from
> scratch, where is the information from the last run stored? It seems like
> the schedule is telling my crawl to ignore pages due to scheduler knocking
> them out. It's not obvious to my why this is happening and how to stop it
> from happening. I think right now this is my major roadblock in getting
> bin/nutch crawl working. Maybe the scheduler code no longer works properly
> in bin/nutch crawl. I can't tell if it's that or if the default
> configurations don't work.

I think you might be talking about the Fetcher: there is documentation of it
> here:
>
> http://bit.ly/alqFoA
> http://wiki.apache.org/nutch/FetchOptions
> http://wiki.apache.org/nutch/CommandLineOptions
>
>
I'm talking about the part of the fetcher that keeps it from fetching the
same document within a specific time frame.


> > 2, Where are the control files in conf documented? How do I know which
> ones
> > do what and when? There's a half dozen *-urlfilters. Why?
>
> Some of these are admittedly newer features but others are not:
>
> http://wiki.apache.org/nutch/RegexURLFiltersBenchs
> http://bit.ly/b99NLK
>
> >
> > 3. Why doesn't your post nightly compile tests include bin/nutch crawl or
> if
> > it does, why didn't it find the error that stopped it from running?
>
> Good question. I'm not super familiar with the nightly tests, but my guess
> is that the scripts are outside the context of the tests since most of the
> tests use Junit and are testing the Java API and classes. I may be wrong
> though.
>

Then that means that you need more unit and process tests that are run
before a release candidate. If the nightly build tests are this weak, you
can't depend on them to tell you all you need to know. It would keep you
from creating a release candidate that was plainly broken in a most
fundamental way.

> 4. Where is the documentation on how to configure the new tika parser in
> your environment? I see that the old parsers have been removed by default,
> but there's nothing that shows me how to include/exclude document types.

Julien Nioche put this together on the TikaPlugin:
>
> http://wiki.apache.org/nutch/TikaPlugin
>

Great, thanks. I'll try to get back into my studies of how 1.1 works as I
can. Work is very busy and full of demands. For now, I've been dodging
questions about Nutch so I can try to understand it better. But the few very
pointed questions the I asked here last week were not answered, so I started
working on another project. I believe I'm a pretty good communicator and I
think I asked complete questions that were not vague.


> > I believe your assessment of 'ready' is not inclusive of some very
> important
> > things and that you would be doing a service to newcomers to bring
> > documentation in line with current offerings. This is not trivial code
> and
> > it takes a long time for someone from the outside to understand it. That
> > process is being stifled on multiple fronts as far as I can see. Either
> that
> > or I have missed an important document that exists and I haven't read it.
>
> Ready in the sense of the release is a consensus decision made by the
> developers and community based on a variety of things:
>
> * issues being resolved in JIRA of a particular priority
> * time in-between last release
> * community requesting a release
> * according to some pre-defined schedule
> * making a feature release to get out new interesting features
> etc etc.
>

Most of the above are Marketing issues, not release issues, but I'm not on
the staff here, so I won't critique. You have your priorities, that's good
enough for me.

One of the pleasures of Open Source is that there is no marketing department
forcing you to release a product that is not yet ready. We've all lived with
products like that. In the short run it's not fun. And in the long run it
will give you a bad reputation.


> I'm sorry that you are experiencing problems, and our goal is to try and
> address as many as possible and prioritize them, but in the end, Apache has
> a process regarding releases [5], which is based somewhat on input from the
> community (usually in the form of simple majority), but ultimately based on
> a Project Management Committee [6] structure, whose votes are binding on a
> particular release.
>
> I hope that we can work with you to continue to use Nutch and make it
> useful
> in your environment, but in the meanwhile, I would suggest you keep
> plugging
> along, continue to push forward and check out some of the references I
> included in this email moving forward.
>

My first, second and third attempt at getting 1.1 working were to duplicate
what I did with 1.0 to get it working. Even going so far as to dump the
entire directory and start over multiple times in hopes that I just did
something wrong.

I have found at least two bugs, one of them I tracked down and fixed and
submitted code. The other I don't even know where to start the hunt and that
is what lead me to post some questions here.

I'd appreciate it if someone knowledgeable would look at those questions
from last week and give me some feedback.

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Reply via email to