[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Ted Dunning (JIRA) Thu, 02 Dec 2010 10:16:41 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966215#action_12966215
 ]


Ted Dunning commented on PDFBOX-521:
------------------------------------


I don't want to derail this commit, but I would love to do a brain dump with 
somebody regarding the (unreleasable) experiments that I did regarding text 
flow and paragraph detection on a variety of real documents last year.

The 30 second summary is that I grouped characters into single-line text chunks 
based on proximity and font size similarity.  This was relatively easy and gave 
pretty high quality lines except in the cases of large paragraph initial 
letters.  Multi-column documents were handled correctly at this level.

The next level tried to string these line segments into a text flow and assign 
gross characteristics to the resulting paragraph-like chunks.  These 
characteristics include boiler-plate, page number, footnote, title, side-bar, 
caption, inset, section header and main text flow.  The way that I did this was 
to associate line segments with likely flow connections based on proximity and 
add hints based on overall frequency of font in the document and whether 
identical or similar text appeared on multiple pages.  These association and 
global features were enough to train a very simple model that was very accurate 
at marking the characteristics of interest.  The result was a very clean text 
flow even in the presence of figures, multiple columns, insets, titles and 
footnotes such as you typically find in technical documents.  Tables were not 
handled quite as well, but I think that they could be done well.

There is no way for me to redo this work right now, but it seems like a really 
cool capability for PdfBox to have and I would hate to see the effort I put in 
be totally wasted if there is somebody who could use the head-start that I 
could provide.

I will re-subscribe to the dev mailing list for  a short time in order to catch 
responses to this offer or people can contact me directly at my apache mailing 
address ([email protected]).


> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>            Assignee: Andreas Lehmkühler
>         Attachments: pdftextstripper2.zip, pdftextstripper_patch.txt
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to