RE: Text Extraction with multi-column documents in PDFBox

Martinez, Mel - 1004 - MITLL Thu, 31 Mar 2011 08:58:39 -0700

Ted,

A lot depends on how the PDF file was generated, but in general, so long as you 
leave the 'sort by position' attribute of the PDFBox' PDFTextStripper as 
'false' (the default) then the text extraction will be (mostly) logical and not 
positional.

        PDFTextStripper myStripper = ...
        myStripper.setSortByPosition(false);  //not actually necessary since 
false is the default.

That is, if you have text in two columns on a page, the lines will be extracted 
by article and not cross columns.

SOME PDFs can be (and unfortunately are) generated such that the text objects 
are not logically arranged by article and the extraction still messes up.  But 
in my experience on most documents it does a pretty good job, especially those 
generated from word processors.

The only recurring glitches tend to be where text in headers and footers gets 
inserted and sometimes a floating text box will be inserted in the extracted 
text quite far from where it appears on the page.  But the block of text from 
the box usually will at least be integral and not chopped up.

The times when you may WANT to sort by position is when parsing text from PDFs 
that are more graphical in nature, such as those generated from PowerPoint type 
documents.   Even then though, it depends a lot on how the page is structured.  
 A bit of testing is usually necessary to figure out which setting works best 
with the particular PDF.

As of 1.4 we have a lot of instrumentation that allows you to override / 
customize the demarcation between the following structural points:

Page
Article
Paragraph
Line
Word

All you have to do is apply the demarcations that you would prefer using the 
setters or for more complex cases subclass the stripper and override the 
behavior of the getters for the start/stop demarcations.

In my own usage I have used this to extract text into a simple xml format with 
above tags and this has been applied to thousands of documents from a variety 
of sources.  For the most part, this works pretty well.

Good luck,

Mel

-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Wednesday, March 30, 2011 1:05 PM
To: [email protected]; [email protected]
Subject: Fwd: Text Extraction with multi-column documents in PDFBox

---------- Forwarded message ----------
From: Ted Dunning <[email protected]>
Date: Wed, Mar 30, 2011 at 10:04 AM
Subject: Re: Text Extraction with multi-column documents in PDFBox
To: Jeremy Barkan <[email protected]>

I haven't looked at that lately so I may be a bit wrong on details, but if
you look at the sample article that I posted, you can see how simply
following any heuristic for generating the flow based on position alone will
not work.  The text inset on the first page, for instance, will get the
columns all confused.  The current heuristics are probably fine for finding
individual lines, but not for splitting lines into columns and then
threading those lines into correct flows and marking those flows as text or
decoration.  Moreover, there are important cues given by font and size that
need to be used.  One such cue is whether the text is in the majority font.
 This alone is enough to separate about 90% of the main flow of the document
from other parts fo the document (for the journals I examined).   Most of
the remaining 10% can be had from considering geometrical cues in the
context of that initial assignment, but without the original assignment
based on fonts, the geometry isn't really strong enough.

I think that there is more to be done with what I started in that you can
look at how things came out from the first pass and use statistics
describing positions on the page and font/size/position transitions within a
single text type to refine the statistical model of the document.  That
would allow the flow to be recalculated, hopefully handling a few corner
cases more accurately.

My original goal was to simply remove the boiler-plate from the document and
leave a residue that would allow a high quality retrieval index to be
created.  The final results were nearly good enough to present as a
simplified, text-only surrogate for the document, but not quite.  They were
certainly quite readable, but not very pretty.

On Wed, Mar 30, 2011 at 9:54 AM, Jeremy Barkan <[email protected]> wrote:

> How is what you describe similar or different than the charactersByArticle
> method of PDFTextStripper ?
>
>
>
> Thanks so much for your help
>
>
>
> Best Regards
>
>
>
> Jeremy
>
>
>
>
>
> *Jeremy Barkan*
>
>
>
> Tel: +972 2 6728069
>
> Mobile: +972 54 6321603
>
> Skype: jeremy_barkan
>
>
>
> *From:* Ted Dunning [mailto:[email protected]]
> *Sent:* 30 March 2011 17:55
> *To:* Jeremy Barkan
> *Subject:* Re: Text Extraction with multi-column documents in PDFBox
>
>
>
> Neither.
>
>
>
> Never.
>
>
>
> It would be very helpful to have it, though.
>
> On Wed, Mar 30, 2011 at 8:52 AM, Jeremy Barkan <[email protected]>
> wrote:
>
> Thanks for getting back to me – I was looking into this kind of algorithm.
>
> Was this merged into PDFBox 1.4 or 1.5 ?
>
> I'm trying to decide if to implement this on my own on top of PDFBox or to
> use what PDFBox would have already implemented
>
>
>

smime.p7s
Description: S/MIME cryptographic signature

RE: Text Extraction with multi-column documents in PDFBox

Reply via email to