[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Ted Dunning (JIRA) Mon, 21 Jun 2010 09:32:50 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880885#action_12880885
 ]


Ted Dunning commented on PDFBOX-521:
------------------------------------


I have done a fair bit of experimentation with this sort of logical extraction 
and have found a few salient results (at least for technical documents):

- the font in which the majority of characters are rendered is almost certainly 
the body font.

- line segments can readily be detected, even in two column flows

- grouping line segments is not possible using context independent rules, but 
combining simple features like is-body-font, spacing to previous and next 
vertically aligned line segment and a bit of font information for the previous 
and next line segment gives a good approximation of segment to segment reading 
flow.  I did find it useful to train a simple classifier here.

- separating two column flows from side-bars is pretty reliable using the flow 
model above

- headers and footers are easily discerned by specialized font and the fact 
that they repeat almost verbatim from page to page

- footnotes are reliably detected by font distinctions

- document title material is also well detected.

I don't have code that I can share just now, but may have some later this 
month.  I would be happy to kibitz if somebody else is doing the work.

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to