>The nutch "dev team" isn't focused on PDF parsing. Nutch is a search engine framework,
IMHO, if you don't parse something correctly, you cannnot rely on the results. We have all parsed things where you leave a comma out and the parse results are wrong. If there was a bug in nutches html parsing would that be a big deal? Howabout if it parsed the text in a particular tag out of order? Pdf is unfortunately not html where you can parse the file sequentially and get an accurate result, but its use is second most ubiquotous. PDFBox is not a PDF parsing frmaework either. It has some pdf parsing algorithms, that aren't being used. Google does a good job parsing pdf, nutch has to do if its ogin to compete. -----Original Message----- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Saturday, March 04, 2006 4:10 PM To: [email protected] Subject: Re: project vitality? Hello, I've been following this conversation for the past week and decided that I'd go ahead and chime in now. I think that honestly this whole thread of discussion needs to be taken off list, because it doesn't really have anything to do with the "use" of Nutch: what it boils down to is a list of complaints, requests for improvements and what not. Nutch's goal is to be a large-scale, open source search engine: it's not a PDF parsing framework, nor is it as thoroughly documented as some commercial software -- although I've ran into many commercial software products that don't have the same quality of documentation that Nutch even has now in its nascent stages. > Now that I have said that, I want to express my feeling that it's hard > when it takes a week to figure out that invertlinks only applies to > version 0.8. and when you ask to become a volunteer, you are met with > no response. You don't need to "ask" to become a volunteer: just do it. As Doug said, create a patch, submit the patch to JIRA and let the community look at it. Change something on the Wiki if you don't think that the documentation is particularly well there. Use Nutch to do whatever you like, and if you feel that you contributed something that is applicable to a broader community outside of your domain, let people know about it. If it's really cool, I wouldn't worry about people ignoring you: they'll come around. > It's also frustrating when you share some heard earned insights into > something that nutch needs to work on, like pdf parsing, and your > comments don't get a single good response from the nutch dev team. The nutch "dev team" isn't focused on PDF parsing. Nutch is a search engine framework, and to Nutch, a PDF parser is a "black box" that conforms to a standard parsing interface that can be swapped out as technology evolves. Right now, Nutch uses PDFBox, but in a week it could use "hot super new rad PDF parsing technology X.1", or some other greater PDF parser. If you feel that PDFBox isn't getting the job done for your particular domain, then post an actual question, not pointers to documents for the Nutch developers to go read. Honestly, I'm guessing they don't have the time, nor the desire to go read a whole bunch of PDF documentation unless there's a real use case, and a real need to upgrade the existing parser. Empirically show that Nutch's PDF capabilities aren't getting the job done, post your results to the list, and let the community look them. I'd guess you'd generate more interest and probably get a better response that way. > > Sometimes, in OS projects I get the feeling that the developers > breathe different air than users, and that our help is not wanted or > that our questions are stupid and not worth their time to answer. As far as I can tell the Nutch developers all breathe the same air as us (and moreover, I believe they put on their pants "one leg at a time") > > Nutch is nowhere near being a dead project, that is not what I said (I > said it was close, not closed), its just that I don't feel that it's > something that anyone can just download and use without running into > problems. Problems is a generic word: I would agree with your statement if you qualified what "problems" means. Small problems like configuration issues? I'd buy that. Exception messages not providing super super detailed information about the error? Sure, I'd even buy that in some cases. However, larger, bigger problems that generally fall in the class of "bugs"? I would say the answer to that is probably a "no". > Problems always exist, but need to be documented correctly so that > they can be solved quickly. I think nutch has a long way to go before > it is comparable to tomcat or httpd, which are both production ready > and have literally volumes of information on using in every manner > possible. Check out the commiters list on Tomcat ( http://tomcat.apache.org/whoweare.html) versus that of Nutch ( http://lucene.apache.org/nutch/credits.html). 21 active commiters on the Tomcat PMC and many more emeritus commiters. Nutch has less than 10. To have the wealth of capability and functionality that Nutch provides, with the ability to deploy it in production quality environments (which I can assure you, after having been on the mailing lists for the better part of a year, there are plenty), and its ease of use, I would have to respectfully disagree with the majority of your assertions and say that the Nutch folks are doing a great job. Now, can we please take this discussion off the public mailing lists? I would think that the majority of folks on the list would like to move on. I know that I would. Cheers, Chris > > I am sorry if you don't like my opinion or the way it is expressed. > > -----Original Message----- > From: carmmello [mailto:[EMAIL PROTECTED] > Sent: Saturday, March 04, 2006 10:54 AM > To: [email protected] > Subject: RE: project vitality? > > > I really can not agree with the way Mr. Richard Braman express his > views. I have tried Nutch since version 0.3 and I could not make the > 0.8 release work (Nutch is becoming a little bit complicated with all > those map reduce, hadoop, and so on, that I can't deal with). I > understand, however, that if a product is not finished yet, some > times it may fail with the lack of some fundamental documentation, > but, if there is a bunch of people who develops, for free, a product > that is commercially worth some thousands of dollars and may fit our > purposes, we have to say thanks. After that we can, of course, > express our views, complaints and suggestions, but we should refrain > from some hard, non relevant comments, that goes nowhere, like this, > non technical, post of mine. I, myself, have my own experimental > implementation of Nutch 0.7.1.x (a nightly version), with more than > 400,000 pages, that can be, sometimes, viewed at brazilian working > hours, at http://www.qualidade.eng.br/constelacao.htm . It is in > portuguese, but english terms related to quality, standards and > environment can be searched. >
