[
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880885#action_12880885
]
Ted Dunning commented on PDFBOX-521:
------------------------------------
I have done a fair bit of experimentation with this sort of logical extraction
and have found a few salient results (at least for technical documents):
- the font in which the majority of characters are rendered is almost certainly
the body font.
- line segments can readily be detected, even in two column flows
- grouping line segments is not possible using context independent rules, but
combining simple features like is-body-font, spacing to previous and next
vertically aligned line segment and a bit of font information for the previous
and next line segment gives a good approximation of segment to segment reading
flow. I did find it useful to train a simple classifier here.
- separating two column flows from side-bars is pretty reliable using the flow
model above
- headers and footers are easily discerned by specialized font and the fact
that they repeat almost verbatim from page to page
- footnotes are reliably detected by font distinctions
- document title material is also well detected.
I don't have code that I can share just now, but may have some later this
month. I would be happy to kibitz if somebody else is doing the work.
> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
> Key: PDFBOX-521
> URL: https://issues.apache.org/jira/browse/PDFBOX-521
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: all
> Reporter: Mel Martinez
> Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is
> to ignore paragraph demarcation in the text. It basically just renders each
> line of text as it discovers it, separating each line equally with the same
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops
> in the extracted text. This is often necessary for text processing that
> needs to work with logical 'chunks' of text. Further, rendering into other
> formats (such as HTML or XML) is facilitated by resolving the document into
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete
> instrumentation of the parsing, allowing one to identify / tag paragraph
> starts and stops.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.