Well ultimately this is going to be difficult because PDF is not a logical data store. It is a rendering state engine. SOME times the data objects in it are fortuitously arranged to fit a desired logical structure, but there is no guarantee of that.
If you have some foreknowledge about the structure of a given corpus of documents, you may be able to right some custom code that figures things out, but otherwise, PDF in general is simply not designed for that purpose. In the documents I’ve been extracting, hyphen-breaks at end of line seem to be preserved and it seems like it would be straightforward to detect those to do reconstruction with. However, the devil is in the details and your documents may not be as cooperative. Good luck! From: Ted Dunning [mailto:[email protected]] Sent: Thursday, March 31, 2011 1:28 PM To: [email protected] Cc: Martinez, Mel - 1004 - MITLL; [email protected]; [email protected] Subject: Re: Text Extraction with multi-column documents in PDFBox Yes. This use of the native flow works about 50-80% of the time in my experience. But it was waay to error prone to depend on and failed spectacularly for many critical data sources. Even where it worked, the results were often not good enough. For one thing, I needed real text flow so that I could reliably reverse engineer hyphenation (for text indexing). I also needed to reliably remove headers, footers, page numbers, article titles and similar boilerplate across thousands of document sources without hand engineering each kind of document. On Thu, Mar 31, 2011 at 8:58 AM, Martinez, Mel - 1004 - MITLL <[email protected]> wrote: Ted, A lot depends on how the PDF file was generated, but in general, so long as you leave the 'sort by position' attribute of the PDFBox' PDFTextStripper as 'false' (the default) then the text extraction will be (mostly) logical and not positional. PDFTextStripper myStripper = ... myStripper.setSortByPosition(false); //not actually necessary since false is the default. That is, if you have text in two columns on a page, the lines will be extracted by article and not cross columns. Sort of. As I mentioned, the quality across a bunch of data sources was just not good enough to even contemplate deployment. Moreover, there was no way forward to improve the situation. SOME PDFs can be (and unfortunately are) generated such that the text objects are not logically arranged by article and the extraction still messes up. But in my experience on most documents it does a pretty good job, especially those generated from word processors. I was working against documents from publishers. My results were much worse than what you ahve seen, it sounds like. The only recurring glitches tend to be where text in headers and footers gets inserted and sometimes a floating text box will be inserted in the extracted text quite far from where it appears on the page. But the block of text from the box usually will at least be integral and not chopped up. Only sometimes. The rearrangements in practice are quite capricious. The times when you may WANT to sort by position is when parsing text from PDFs that are more graphical in nature, such as those generated from PowerPoint type documents. Even then though, it depends a lot on how the page is structured. A bit of testing is usually necessary to figure out which setting works best with the particular PDF. And my requirement was that I could not accept any magical knob turning. My solution had to work across a huge range of sources. As of 1.4 we have a lot of instrumentation that allows you to override / customize the demarcation between the following structural points: Page Article Paragraph Line Word That just doesn't really help. I needed auto-tuning, line unbreaking and real flow following.
smime.p7s
Description: S/MIME cryptographic signature
