[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

John Logan (JIRA) Fri, 04 Dec 2015 16:12:25 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042460#comment-15042460
 ]


John Logan commented on PDFBOX-2998:
------------------------------------

Not sure whether this is a great place to put this comment, but I was having a 
look at the text extraction bugs and saw the general discussion.

One area that could be improved with a little effort is in parameter selection 
for paragraph detection.  I put together a POC of this as this solving this 
problem helps me out a lot.

What I did was create an analyzer, based on the PDFTextStripper code, that 
stores the collection of drops and indents for a page, and then applies a crude 
heuristic to determine reasonable threshold values.  It appears to function 
pretty well for the test cases I have where the embedded default values are too 
low.

>From an implementation standpoint the solution is wanting because it's not 
>very DRY.  I originally implemented the solution directly in PDFTextStripper 
>using a two pass scan, but that makes an already complicated method even more 
>so.

> Enhance the text extraction capabilities
> ----------------------------------------
>
>                 Key: PDFBOX-2998
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2998
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Meier
>         Attachments: DropCapExample1.pdf, DropCapExample2.pdf, 
> DropCapExample3.pdf, DropCapExample4.pdf, DropCapExample5.pdf, 
> DropCapSegmentation.jpg, TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

Reply via email to