Hi Richard, > IMHO, if you don't parse something correctly, you cannnot rely on the > results.
Good, we're on the same page here. > We have all parsed things where you leave a comma out and the parse > results are wrong. If there was a bug in nutches html parsing would > that be a big deal? Yes, it would be. HTML is the foundation for the web. Its content is the most pervasive out there (as you allude to below). > Howabout if it parsed the text in a particular tag > out of order? I'm wondering what that has to do with anything? You may want to read up on Lucene (http://lucene.apache.org/). Lucene is the underlying text search api (and index format) that Nutch is built on top of, and I'm wondering if it cares about the order in which a piece of text is given to it? > Pdf is unfortunately not html where you can parse the > file sequentially and get an accurate result, Gonna have to disagree with you on this. You're making a general statement that's not true across the board. I would assert that in many cases, you can still get an accurate result. What about a PDF research paper? Do you care about what order the text comes in if you're just doing general "Google like" search. When I go to Google and type "grid computing papers", do I care that "grid computing" comes before some text within the research paper? Possibly, but mainly I care that "grid computing" was an emphasized phrase within the text. Now, your definition of "emphasized" may not just be that it's the first text that appears in the paper in the title say: you may just care that the frequency of "grid computing" in the paper is relatively higher than a certain threshold compared to other terms. On the other hand, the fact that "grid computing" is in the title and comes first in the PDF may mean a lot to you. in That's the nature of trying to extract structure out of inherently unstructured content. I'm not saying that the structure or order of text within a document is never useful: I agree that in a lot of cases, it can help you to infer what values are associated with what fields you want to index, etc. All I'm saying is that it's certainly a subset of the greater functionality of just doing free text search, so you shouldn't generalize and that that you can't parse a PDF sequentially and obtain good results. > but its use is second most > ubiquotous. PDFBox is not a PDF parsing frmaework either. It has some > pdf parsing algorithms, that aren't being used. Google does a good job > parsing pdf, nutch has to do if its ogin to compete. Can you show that Google's PDF parsing capability is any better than Nutch's using accepted evaluation methods for PDF? How about some real use cases and real results? Until we could see such numbers, I'm hesitant to believe what you're saying is true. If it is though, then I'm sure that the community would welcome any updates to the PDF parsing plugin that expedite its improvement. Cheers, Chris > > > > > -----Original Message----- > From: Chris Mattmann [mailto:[EMAIL PROTECTED] > Sent: Saturday, March 04, 2006 4:10 PM > To: [email protected] > Subject: Re: project vitality? > > > Hello, > > I've been following this conversation for the past week and decided > that I'd go ahead and chime in now. I think that honestly this whole > thread of discussion needs to be taken off list, because it doesn't > really have anything to do with the "use" of Nutch: what it boils down > to is a list of complaints, requests for improvements and what not. > Nutch's goal is to be a large-scale, open source search engine: it's not > a PDF parsing framework, nor is it as thoroughly documented as some > commercial software -- although I've ran into many commercial software > products that don't have the same quality of documentation that Nutch > even has now in its nascent stages. > >> Now that I have said that, I want to express my feeling that it's hard > >> when it takes a week to figure out that invertlinks only applies to >> version 0.8. and when you ask to become a volunteer, you are met with >> no response. > > You don't need to "ask" to become a volunteer: just do it. As Doug said, > create a patch, submit the patch to JIRA and let the community look at > it. Change something on the Wiki if you don't think that the > documentation is particularly well there. Use Nutch to do whatever you > like, and if you feel that you contributed something that is applicable > to a broader community outside of your domain, let people know about it. > If it's really cool, I wouldn't worry about people ignoring you: they'll > come around. > >> It's also frustrating when you share some heard earned insights into >> something that nutch needs to work on, like pdf parsing, and your >> comments don't get a single good response from the nutch dev team. > > The nutch "dev team" isn't focused on PDF parsing. Nutch is a search > engine framework, and to Nutch, a PDF parser is a "black box" that > conforms to a standard parsing interface that can be swapped out as > technology evolves. Right now, Nutch uses PDFBox, but in a week it could > use "hot super new rad PDF parsing technology X.1", or some other > greater PDF parser. If you feel that PDFBox isn't getting the job done > for your particular domain, then post an actual question, not pointers > to documents for the Nutch developers to go read. Honestly, I'm guessing > they don't have the time, nor the desire to go read a whole bunch of PDF > documentation unless there's a real use case, and a real need to upgrade > the existing parser. Empirically show that Nutch's PDF capabilities > aren't getting the job done, post your results to the list, and let the > community look them. I'd guess you'd generate more interest and probably > get a better response that way. > >> >> Sometimes, in OS projects I get the feeling that the developers >> breathe different air than users, and that our help is not wanted or >> that our questions are stupid and not worth their time to answer. > > As far as I can tell the Nutch developers all breathe the same air as us > (and moreover, I believe they put on their pants "one leg at a time") > >> >> Nutch is nowhere near being a dead project, that is not what I said (I > >> said it was close, not closed), its just that I don't feel that it's >> something that anyone can just download and use without running into >> problems. > > Problems is a generic word: I would agree with your statement if you > qualified what "problems" means. Small problems like configuration > issues? I'd buy that. Exception messages not providing super super > detailed information about the error? Sure, I'd even buy that in some > cases. However, larger, bigger problems that generally fall in the class > of "bugs"? I would say the answer to that is probably a "no". > >> Problems always exist, but need to be documented correctly so that >> they can be solved quickly. I think nutch has a long way to go before > >> it is comparable to tomcat or httpd, which are both production ready >> and have literally volumes of information on using in every manner >> possible. > > Check out the commiters list on Tomcat ( > http://tomcat.apache.org/whoweare.html) versus that of Nutch ( > http://lucene.apache.org/nutch/credits.html). 21 active commiters on the > Tomcat PMC and many more emeritus commiters. Nutch has less than 10. To > have the wealth of capability and functionality that Nutch provides, > with the ability to deploy it in production quality environments (which > I can assure you, after having been on the mailing lists for the better > part of a year, there are plenty), and its ease of use, I would have to > respectfully disagree with the majority of your assertions and say that > the Nutch folks are doing a great job. > > Now, can we please take this discussion off the public mailing lists? I > would think that the majority of folks on the list would like to move > on. I know that I would. > > Cheers, > Chris > > >> >> I am sorry if you don't like my opinion or the way it is expressed. >> >> -----Original Message----- >> From: carmmello [mailto:[EMAIL PROTECTED] >> Sent: Saturday, March 04, 2006 10:54 AM >> To: [email protected] >> Subject: RE: project vitality? >> >> >> I really can not agree with the way Mr. Richard Braman express his >> views. I have tried Nutch since version 0.3 and I could not make the >> 0.8 release work (Nutch is becoming a little bit complicated with all > >> those map reduce, hadoop, and so on, that I can't deal with). I >> understand, however, that if a product is not finished yet, some >> times it may fail with the lack of some fundamental documentation, >> but, if there is a bunch of people who develops, for free, a product >> that is commercially worth some thousands of dollars and may fit our >> purposes, we have to say thanks. After that we can, of course, >> express our views, complaints and suggestions, but we should refrain >> from some hard, non relevant comments, that goes nowhere, like this, >> non technical, post of mine. I, myself, have my own experimental >> implementation of Nutch 0.7.1.x (a nightly version), with more than >> 400,000 pages, that can be, sometimes, viewed at brazilian working >> hours, at http://www.qualidade.eng.br/constelacao.htm . It is in >> portuguese, but english terms related to quality, standards and >> environment can be searched. >> >
