[ https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946324#comment-14946324 ]
Andreas Meier commented on PDFBOX-2998: --------------------------------------- I just wanted to fuel the discussion with my snippet. My intention is not to provide code that breaks an already great extraction engine ;) {quote} I'd even start a step before that {quote} Depends on what is possible at the lower Levels... I don't know if I am the right person to take part in that discussion any further, but I will try to provide the "simple view" on a higher level, to address the problem: - Might it be useful to hold some Information like "(Hello World)" in a (meta-)information store, so that pdfbox can later take the single characters and form the word again? (No fonttype or -size needed, just simple character matching based on position and Rotation...) - Would it make sense to check for fonttype and -size and just handle cases like checmical names ([~tboehme] are there any other reasons for different font/size in a word you know?) > Enhance the text extraction capabilities > ---------------------------------------- > > Key: PDFBOX-2998 > URL: https://issues.apache.org/jira/browse/PDFBOX-2998 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction > Affects Versions: 2.0.0 > Reporter: Andreas Meier > Attachments: TextBehindText.pdf > > > PDFBox will need some -document layout analysis tools- enhancement to the > current text extraction to extract text correctly. > At the Moment the text of a document is extracted using the position of > single characters. > This may lead to wrong results, due to the format of the file. > There are good tools such as https://code.google.com/p/lapdftext which we > could use to compare our current output. > Possible enhancements are > - enhance matching of text to a certain line i.e. don't mix up text from > different lines > - better handling of rotated text > - handling of vertical text > - ability to get additional text properties such as font, font size ... > Some of these are already logged as individual tickets -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org