[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

Timo Boehme (JIRA) Tue, 06 Oct 2015 07:57:18 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945139#comment-14945139
 ]


Timo Boehme commented on PDFBOX-2998:
-------------------------------------

The problem is that with setting the character spacing to fancy values even a 
{{(oW)}} won't be one word but each character belongs to separate words (other 
example: 'not a word' with (taw) in one chunk). Thus while in most cases 
applications might group words correctly you have unfortunately a not too small 
number of 'misuses' which is not easy to detect. So the most general solution 
is to first separate all in single characters with correct position and group 
them using the same algorithm independent what the provided text chunk was.

Text direction should in every case be respected while font (name/size) might 
work in most cases but there are cases with mixed fonts/size within same 'word' 
(chemical names using Greek characters etc.).

> Enhance the text extraction capabilities
> ----------------------------------------
>
>                 Key: PDFBOX-2998
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2998
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Meier
>         Attachments: TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

Reply via email to