RE: parsing pdf correctly

Richard Braman Sat, 04 Mar 2006 17:19:16 -0800

We also agree its a general statement that most PDF text is not in
sequential order.  and while definitely not true across the board, it is
definitely true more than it is untrue.


Maybe not at NASA (where the focus is on scientific research papers that
appeal to space types) but definitely in other government agencies where
publications are intended for a more general public audience.  And
definitely is newsletters, and other content where the presentation is
paramount.

PDF is even more layout oriented than html, meaning it often cares less
about the underlying data and focuses solely on presentation.  If the
web was entirely in XML we all know it would be much easier to parse,
but its not, the content is most often in html or PDF.  Html is chicken
to parse compared to PDF.  I have been parsing HTML for the last 10
years, but PDF has basically no underlying structure at all, and the
parsing methods are correspondingly harder.  Even emphasized text in
tags such as H1 don't have a PDF equivalent.  The only thing you can
truly rely on is the pdfs meta data (and what if the author omitted
that), or any tagged content, which PDFBox, and most other PDF parsers
(multivalent, jpedal) don't currently support either, mainly because so
little pdf content is tagged.  Although that may change at federal
agencies because of Section 508.

In many domain specfic searches pdf may be more ubiquotous than html,
especially in government, who puts almost everything in pdf nowadays.

I think it is pretty obvious that google's pdf parsing technology is
better than nutches.  Google converts each pdf into an html page and
stores them as such.  I would venture to guess that google runs the
resultant html page through its html parser in order to score the doc,
instead of just stripping text out.  Maybe I am wrong.

I will run the data once my crawl is complete and report on the results,
if data is what you need to be convinced.




-----Original Message-----
From: Chris Mattmann [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 04, 2006 7:14 PM
To: [email protected]
Subject: Re: project vitality?


Hi Richard,

> IMHO, if you don't parse something correctly, you cannnot rely on the 
> results.

Good, we're on the same page here.

> We have all parsed things where you leave a comma out and the parse 
> results are wrong.  If there was a bug in nutches html parsing would 
> that be a big deal?

Yes, it would be. HTML is the foundation for the web. Its content is the
most pervasive out there (as you allude to below).

> Howabout if it parsed the text in a particular tag
> out of order?

I'm wondering what that has to do with anything? You may want to read up
on Lucene (http://lucene.apache.org/). Lucene is the underlying text
search api (and index format) that Nutch is built on top of, and I'm
wondering if it cares about the order in which a piece of text is given
to it?

> Pdf is unfortunately not html where you can parse the
> file sequentially and get an accurate result,

Gonna have to disagree with you on this. You're making a general
statement that's not true across the board. I would assert that in many
cases, you can still get an accurate result. What about a PDF research
paper? Do you care about what order the text comes in if you're just
doing general "Google like" search. When I go to Google and type "grid
computing papers", do I care that "grid computing" comes before some
text within the research paper? Possibly, but mainly I care that "grid
computing" was an emphasized phrase within the text. Now, your
definition of "emphasized" may not just be that it's the first text that
appears in the paper in the title say: you may just care that the
frequency of "grid computing" in the paper is relatively higher than a
certain threshold compared to other terms. On the other hand, the fact
that "grid computing" is in the title and comes first in the PDF may
mean a lot to you. in That's the nature of trying to extract structure
out of inherently unstructured content. I'm not saying that the
structure or order of text within a document is never useful: I agree
that in a lot of cases, it can help you to infer what values are
associated with what fields you want to index, etc. All I'm saying is
that it's certainly a subset of the greater functionality of just doing
free text search, so you shouldn't generalize and that that you can't
parse a PDF sequentially and obtain good results.

> but its use is second most
> ubiquotous.  PDFBox is not a PDF parsing frmaework either.  It has 
> some pdf parsing algorithms, that aren't being used.  Google does a 
> good job parsing pdf, nutch has to do if its ogin to compete.

Can you show that Google's PDF parsing capability is any better than
Nutch's using accepted evaluation methods for PDF? How about some real
use cases and real results? Until we could see such numbers, I'm
hesitant to believe what you're saying is true. If it is though, then
I'm sure that the community would welcome any updates to the PDF parsing
plugin that expedite its improvement.

Cheers,
  Chris



> 
> 
> 
> 
> -----Original Message-----
> From: Chris Mattmann [mailto:[EMAIL PROTECTED]
> Sent: Saturday, March 04, 2006 4:10 PM
> To: [email protected]
> Subject: Re: project vitality?
> 
> 
> Hello,
> 
>  I've been following this conversation for the past week and decided 
> that I'd go ahead and chime in now. I think that honestly this whole 
> thread of discussion needs to be taken off list, because it doesn't 
> really have anything to do with the "use" of Nutch: what it boils down

> to is a list of complaints, requests for improvements and what not. 
> Nutch's goal is to be a large-scale, open source search engine: it's 
> not a PDF parsing framework, nor is it as thoroughly documented as 
> some commercial software -- although I've ran into many commercial 
> software products that don't have the same quality of documentation 
> that Nutch even has now in its nascent stages.
> 
>> Now that I have said that, I want to express my feeling that it's 
>> hard
> 
>> when it takes a week to figure out that invertlinks only applies to 
>> version 0.8. and when you ask to become a volunteer, you are met with

>> no response.
> 
> You don't need to "ask" to become a volunteer: just do it. As Doug 
> said, create a patch, submit the patch to JIRA and let the community 
> look at it. Change something on the Wiki if you don't think that the 
> documentation is particularly well there. Use Nutch to do whatever you

> like, and if you feel that you contributed something that is 
> applicable to a broader community outside of your domain, let people 
> know about it. If it's really cool, I wouldn't worry about people 
> ignoring you: they'll come around.
> 
>> It's also frustrating when you share some heard earned insights into 
>> something that nutch needs to work on, like pdf parsing, and your 
>> comments don't get a single good response from the nutch dev team.
> 
> The nutch "dev team" isn't focused on PDF parsing. Nutch is a search 
> engine framework, and to Nutch, a PDF parser is a "black box" that 
> conforms to a standard parsing interface that can be swapped out as 
> technology evolves. Right now, Nutch uses PDFBox, but in a week it 
> could use "hot super new rad PDF parsing technology X.1", or some 
> other greater PDF parser. If you feel that PDFBox isn't getting the 
> job done for your particular domain, then post an actual question, not

> pointers to documents for the Nutch developers to go read. Honestly, 
> I'm guessing they don't have the time, nor the desire to go read a 
> whole bunch of PDF documentation unless there's a real use case, and a

> real need to upgrade the existing parser. Empirically show that 
> Nutch's PDF capabilities aren't getting the job done, post your 
> results to the list, and let the community look them. I'd guess you'd 
> generate more interest and probably get a better response that way.
> 
>> 
>> Sometimes, in OS projects I get the feeling that the developers 
>> breathe different air than users, and that our help is not wanted or 
>> that our questions are stupid and not worth their time to answer.
> 
> As far as I can tell the Nutch developers all breathe the same air as 
> us (and moreover, I believe they put on their pants "one leg at a 
> time")
> 
>> 
>> Nutch is nowhere near being a dead project, that is not what I said 
>> (I
> 
>> said it was close, not closed), its just that I don't feel that it's 
>> something that anyone can just download and use without running into 
>> problems.
> 
> Problems is a generic word: I would agree with your statement if you 
> qualified what "problems" means. Small problems like configuration 
> issues? I'd buy that. Exception messages not providing super super 
> detailed information about the error? Sure, I'd even buy that in some 
> cases. However, larger, bigger problems that generally fall in the 
> class of "bugs"? I would say the answer to that is probably a "no".
> 
>> Problems always exist, but need to be documented correctly so that 
>> they can be solved quickly.  I think nutch has a long way to go 
>> before
> 
>> it is comparable to tomcat or httpd, which are both production ready 
>> and have literally volumes of information on using in every manner 
>> possible.
> 
> Check out the commiters list on Tomcat (
> http://tomcat.apache.org/whoweare.html) versus that of Nutch ( 
> http://lucene.apache.org/nutch/credits.html). 21 active commiters on 
> the Tomcat PMC and many more emeritus commiters. Nutch has less than 
> 10. To have the wealth of capability and functionality that Nutch 
> provides, with the ability to deploy it in production quality 
> environments (which I can assure you, after having been on the mailing

> lists for the better part of a year, there are plenty), and its ease 
> of use, I would have to respectfully disagree with the majority of 
> your assertions and say that the Nutch folks are doing a great job.
> 
> Now, can we please take this discussion off the public mailing lists? 
> I would think that the majority of folks on the list would like to 
> move on. I know that I would.
> 
> Cheers,
>   Chris
> 
> 
>> 
>> I am sorry if you don't like my opinion or the way it is expressed.
>> 
>> -----Original Message-----
>> From: carmmello [mailto:[EMAIL PROTECTED]
>> Sent: Saturday, March 04, 2006 10:54 AM
>> To: [email protected]
>> Subject: RE: project vitality?
>> 
>> 
>> I really can not agree with the way Mr. Richard Braman express his 
>> views.  I have tried Nutch since version 0.3 and I could not make the

>> 0.8 release  work (Nutch is becoming a little bit complicated with 
>> all
> 
>> those map reduce, hadoop, and so on, that I can't deal with).  I 
>> understand, however,  that if a product is not finished yet,  some 
>> times it may fail with the lack of some fundamental documentation, 
>> but, if there is a bunch of people who develops, for free, a product 
>> that is commercially worth some thousands of dollars and may fit our 
>> purposes, we have to say thanks.  After that we can, of course, 
>> express our views, complaints and suggestions, but we should refrain 
>> from some hard, non relevant comments, that goes nowhere, like this, 
>> non technical, post of mine. I, myself, have my own experimental 
>> implementation of Nutch 0.7.1.x (a nightly version), with more than 
>> 400,000 pages, that can be, sometimes, viewed at brazilian working 
>> hours, at http://www.qualidade.eng.br/constelacao.htm .  It is in 
>> portuguese, but english terms related to quality, standards and 
>> environment can be searched.
>> 
>

RE: parsing pdf correctly

Reply via email to