[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

Maruan Sahyoun (JIRA) Wed, 07 Oct 2015 00:34:51 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946445#comment-14946445
 ]


Maruan Sahyoun commented on PDFBOX-2998:
----------------------------------------

I'm not the best person to answer the question as text extraction is not my 
domain. What I'm getting is that currently because of the reasons outlined 
above characters are handled individually and a certain logic handles that to 
form the words again. Now if we keep the text done by a single text showing 
operator together would that enhance the text extraction (knowing that a single 
operator might handle an individual character, parts of a word, parts of 
different words or multiple words)?

It would be good to select some documents where the text extraction could be 
enhanced and look at it conceptually. [~AndreasMeier] Would you have some you 
could share?

Technically we can already hook into showText() instead of showGlyph() to deal 
with junks instead of individual characters. 

> Enhance the text extraction capabilities
> ----------------------------------------
>
>                 Key: PDFBOX-2998
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2998
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Meier
>         Attachments: TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

Reply via email to