[
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966215#action_12966215
]
Ted Dunning commented on PDFBOX-521:
------------------------------------
I don't want to derail this commit, but I would love to do a brain dump with
somebody regarding the (unreleasable) experiments that I did regarding text
flow and paragraph detection on a variety of real documents last year.
The 30 second summary is that I grouped characters into single-line text chunks
based on proximity and font size similarity. This was relatively easy and gave
pretty high quality lines except in the cases of large paragraph initial
letters. Multi-column documents were handled correctly at this level.
The next level tried to string these line segments into a text flow and assign
gross characteristics to the resulting paragraph-like chunks. These
characteristics include boiler-plate, page number, footnote, title, side-bar,
caption, inset, section header and main text flow. The way that I did this was
to associate line segments with likely flow connections based on proximity and
add hints based on overall frequency of font in the document and whether
identical or similar text appeared on multiple pages. These association and
global features were enough to train a very simple model that was very accurate
at marking the characteristics of interest. The result was a very clean text
flow even in the presence of figures, multiple columns, insets, titles and
footnotes such as you typically find in technical documents. Tables were not
handled quite as well, but I think that they could be done well.
There is no way for me to redo this work right now, but it seems like a really
cool capability for PdfBox to have and I would hate to see the effort I put in
be totally wasted if there is somebody who could use the head-start that I
could provide.
I will re-subscribe to the dev mailing list for a short time in order to catch
responses to this offer or people can contact me directly at my apache mailing
address ([email protected]).
> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
> Key: PDFBOX-521
> URL: https://issues.apache.org/jira/browse/PDFBOX-521
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: all
> Reporter: Mel Martinez
> Assignee: Andreas Lehmkühler
> Attachments: pdftextstripper2.zip, pdftextstripper_patch.txt
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is
> to ignore paragraph demarcation in the text. It basically just renders each
> line of text as it discovers it, separating each line equally with the same
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops
> in the extracted text. This is often necessary for text processing that
> needs to work with logical 'chunks' of text. Further, rendering into other
> formats (such as HTML or XML) is facilitated by resolving the document into
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete
> instrumentation of the parsing, allowing one to identify / tag paragraph
> starts and stops.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.