Well ultimately this is going to be difficult because PDF is not a logical data 
store.  It is a rendering state engine.   SOME times the data objects in it are 
fortuitously arranged to fit a desired logical structure, but there is no 
guarantee of that.

 

If you have some foreknowledge about the structure of a given corpus of 
documents, you may be able to right some custom code that figures things out, 
but otherwise, PDF in general is simply not designed for that purpose.

 

In the documents I’ve been extracting, hyphen-breaks at end of line seem to be 
preserved and it seems like it would be straightforward to detect those to do 
reconstruction with.   However, the devil is in the details and your documents 
may not be as cooperative.

 

Good luck!

 

From: Ted Dunning [mailto:[email protected]] 
Sent: Thursday, March 31, 2011 1:28 PM
To: [email protected]
Cc: Martinez, Mel - 1004 - MITLL; [email protected]; 
[email protected]
Subject: Re: Text Extraction with multi-column documents in PDFBox

 

Yes.  This use of the native flow works about 50-80% of the time in my 
experience.  But it was waay to error prone to depend on and failed 
spectacularly for many critical data sources.  Even where it worked, the 
results were often not good enough.  For one thing, I needed real text flow so 
that I could reliably reverse engineer hyphenation (for text indexing).  I also 
needed to reliably remove headers, footers, page numbers, article titles and 
similar boilerplate across thousands of document sources without hand 
engineering each kind of document.

On Thu, Mar 31, 2011 at 8:58 AM, Martinez, Mel - 1004 - MITLL 
<[email protected]> wrote:

Ted,

A lot depends on how the PDF file was generated, but in general, so long as you 
leave the 'sort by position' attribute of the PDFBox' PDFTextStripper as 
'false' (the default) then the text extraction will be (mostly) logical and not 
positional.

       PDFTextStripper myStripper = ...
       myStripper.setSortByPosition(false);  //not actually necessary since 
false is the default.

That is, if you have text in two columns on a page, the lines will be extracted 
by article and not cross columns.

 

Sort of.  As I mentioned, the quality across a bunch of data sources was just 
not good enough to even contemplate deployment.  Moreover, there was no way 
forward to improve the situation.

 

SOME PDFs can be (and unfortunately are) generated such that the text objects 
are not logically arranged by article and the extraction still messes up.  But 
in my experience on most documents it does a pretty good job, especially those 
generated from word processors.

 

I was working against documents from publishers.  My results were much worse 
than what you ahve seen, it sounds like.

 

The only recurring glitches tend to be where text in headers and footers gets 
inserted and sometimes a floating text box will be inserted in the extracted 
text quite far from where it appears on the page.  But the block of text from 
the box usually will at least be integral and not chopped up.

 

Only sometimes.  The rearrangements in practice are quite capricious.

 

The times when you may WANT to sort by position is when parsing text from PDFs 
that are more graphical in nature, such as those generated from PowerPoint type 
documents.   Even then though, it depends a lot on how the page is structured.  
 A bit of testing is usually necessary to figure out which setting works best 
with the particular PDF.

 

And my requirement was that I could not accept any magical knob turning.  My 
solution had to work across a huge range of sources.

 

As of 1.4 we have a lot of instrumentation that allows you to override / 
customize the demarcation between the following structural points:

Page
Article
Paragraph
Line
Word

 

That just doesn't really help.  I needed auto-tuning, line unbreaking and real 
flow following.

 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to