>The nutch "dev team" isn't focused on PDF parsing. Nutch is a search
engine framework, 

IMHO, if you don't parse something correctly, you cannnot rely on the
results.  
We have all parsed things where you leave a comma out and the parse
results are wrong.  If there was a bug in nutches html parsing would
that be a big deal? Howabout if it parsed the text in a particular tag
out of order?  Pdf is unfortunately not html where you can parse the
file sequentially and get an accurate result, but its use is second most
ubiquotous.  PDFBox is not a PDF parsing frmaework either.  It has some
pdf parsing algorithms, that aren't being used.  Google does a good job
parsing pdf, nutch has to do if its ogin to compete.




-----Original Message-----
From: Chris Mattmann [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 04, 2006 4:10 PM
To: [email protected]
Subject: Re: project vitality?


Hello,

 I've been following this conversation for the past week and decided
that I'd go ahead and chime in now. I think that honestly this whole
thread of discussion needs to be taken off list, because it doesn't
really have anything to do with the "use" of Nutch: what it boils down
to is a list of complaints, requests for improvements and what not.
Nutch's goal is to be a large-scale, open source search engine: it's not
a PDF parsing framework, nor is it as thoroughly documented as some
commercial software -- although I've ran into many commercial software
products that don't have the same quality of documentation that Nutch
even has now in its nascent stages.

> Now that I have said that, I want to express my feeling that it's hard

> when it takes a week to figure out that invertlinks only applies to 
> version 0.8. and when you ask to become a volunteer, you are met with 
> no response.

You don't need to "ask" to become a volunteer: just do it. As Doug said,
create a patch, submit the patch to JIRA and let the community look at
it. Change something on the Wiki if you don't think that the
documentation is particularly well there. Use Nutch to do whatever you
like, and if you feel that you contributed something that is applicable
to a broader community outside of your domain, let people know about it.
If it's really cool, I wouldn't worry about people ignoring you: they'll
come around.

> It's also frustrating when you share some heard earned insights into 
> something that nutch needs to work on, like pdf parsing, and your 
> comments don't get a single good response from the nutch dev team.

The nutch "dev team" isn't focused on PDF parsing. Nutch is a search
engine framework, and to Nutch, a PDF parser is a "black box" that
conforms to a standard parsing interface that can be swapped out as
technology evolves. Right now, Nutch uses PDFBox, but in a week it could
use "hot super new rad PDF parsing technology X.1", or some other
greater PDF parser. If you feel that PDFBox isn't getting the job done
for your particular domain, then post an actual question, not pointers
to documents for the Nutch developers to go read. Honestly, I'm guessing
they don't have the time, nor the desire to go read a whole bunch of PDF
documentation unless there's a real use case, and a real need to upgrade
the existing parser. Empirically show that Nutch's PDF capabilities
aren't getting the job done, post your results to the list, and let the
community look them. I'd guess you'd generate more interest and probably
get a better response that way.

> 
> Sometimes, in OS projects I get the feeling that the developers 
> breathe different air than users, and that our help is not wanted or 
> that our questions are stupid and not worth their time to answer.

As far as I can tell the Nutch developers all breathe the same air as us
(and moreover, I believe they put on their pants "one leg at a time")

> 
> Nutch is nowhere near being a dead project, that is not what I said (I

> said it was close, not closed), its just that I don't feel that it's 
> something that anyone can just download and use without running into 
> problems.

Problems is a generic word: I would agree with your statement if you
qualified what "problems" means. Small problems like configuration
issues? I'd buy that. Exception messages not providing super super
detailed information about the error? Sure, I'd even buy that in some
cases. However, larger, bigger problems that generally fall in the class
of "bugs"? I would say the answer to that is probably a "no".

> Problems always exist, but need to be documented correctly so that 
> they can be solved quickly.  I think nutch has a long way to go before

> it is comparable to tomcat or httpd, which are both production ready 
> and have literally volumes of information on using in every manner 
> possible.

Check out the commiters list on Tomcat (
http://tomcat.apache.org/whoweare.html) versus that of Nutch (
http://lucene.apache.org/nutch/credits.html). 21 active commiters on the
Tomcat PMC and many more emeritus commiters. Nutch has less than 10. To
have the wealth of capability and functionality that Nutch provides,
with the ability to deploy it in production quality environments (which
I can assure you, after having been on the mailing lists for the better
part of a year, there are plenty), and its ease of use, I would have to
respectfully disagree with the majority of your assertions and say that
the Nutch folks are doing a great job.

Now, can we please take this discussion off the public mailing lists? I
would think that the majority of folks on the list would like to move
on. I know that I would.

Cheers,
  Chris


> 
> I am sorry if you don't like my opinion or the way it is expressed.
> 
> -----Original Message-----
> From: carmmello [mailto:[EMAIL PROTECTED]
> Sent: Saturday, March 04, 2006 10:54 AM
> To: [email protected]
> Subject: RE: project vitality?
> 
> 
> I really can not agree with the way Mr. Richard Braman express his 
> views.  I have tried Nutch since version 0.3 and I could not make the 
> 0.8 release  work (Nutch is becoming a little bit complicated with all

> those map reduce, hadoop, and so on, that I can't deal with).  I 
> understand, however,  that if a product is not finished yet,  some 
> times it may fail with the lack of some fundamental documentation, 
> but, if there is a bunch of people who develops, for free, a product 
> that is commercially worth some thousands of dollars and may fit our 
> purposes, we have to say thanks.  After that we can, of course, 
> express our views, complaints and suggestions, but we should refrain 
> from some hard, non relevant comments, that goes nowhere, like this, 
> non technical, post of mine. I, myself, have my own experimental 
> implementation of Nutch 0.7.1.x (a nightly version), with more than 
> 400,000 pages, that can be, sometimes, viewed at brazilian working 
> hours, at http://www.qualidade.eng.br/constelacao.htm .  It is in 
> portuguese, but english terms related to quality, standards and 
> environment can be searched.
> 

Reply via email to