[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Ted Dunning (JIRA) Mon, 26 Jul 2010 11:22:42 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892402#action_12892402
 ]


Ted Dunning commented on PDFBOX-521:
------------------------------------

{quote}
3. I had to change the normalize method, as it didn't work for rtl text. 
The old implementation asked every TextPosition, if the logical order has to be 
changed. hello3.pdf from our test arena consists of 3 words "Hello محمد 
World.". There is one TextPosition for (nearly) every character. As it isn't 
possible to change the order of just one character, we have to combine the 
characters to words. Those can be reordered and everything works fine. 

Any comments, further suggestions? 
{quote}

This is a great improvement, but what happens with two-column text?

In my experience, aggregating words is a good thing when trying to derive flow 
sensible text, but aggregating lines while separating columns is even better.

Take, for example, the first page of this document: 
http://www.bioone.org/doi/pdf/10.1525/auk.2009.11009

There are a number of issues that come up here in trying to extract readable 
flows, pretty much all of which can be handled by aggregating text into 
line-like units.   The approach I took (in unfortunately undistributable code) 
was to 

a) separate into line units according to vertical and horizontal spacing, and 
font size.  This is straight forward and serves to separate all of the text 
blocks that need separating.

b) build a weighted DAG of adjacent line units based on horizontal overlap, 
font-size equality and vertical spacing.  To use vertical spacing cues well, I 
re-weighted vertical spacing according to how common that spacing is in the 
document.  These factors make normal text lines appear to be very close 
together since they have high overlap, the same font and a very common spacing. 
 This DAG can then be made sparse so that each line has at most one successor.  
The output of this stage is a set of text blocks that have high internal 
quality in terms of being coherent reading units.

c) thread the text blocks.  Even something as simple as 
top-to-bottom/left-to-right works well.  The result here is a highly readable 
flow with good marking on text that appears to be exceptional such as 
foot-notes, titles and page numbers.


> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>            Assignee: Andreas Lehmkühler
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to