[
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946604#comment-14946604
]
Andreas Meier edited comment on PDFBOX-2998 at 10/7/15 9:58 AM:
----------------------------------------------------------------
Thanks for pointing that out [~tboehme].
Unfortunately, I got no files that show some of the charactersitic we discuss.
I created some drop cap pdf-files which may be used for further Investigation
on that case, but at the moment I see no way to handle any of these problems.
Even with additional programs doing a segmentation like the one in the jpeg I
posted, it won't be easy.
was (Author: andreasmeier):
Thanks for pointing that out [~tboehme].
Unfortunately, I got no files that show some of the charactersitic we discuss.
I created some drop cap pdf-files which may be used for further Investigation
on that case, but at the moment I see no way to handle any of these problems.
Even with additional programs doing a segmentation like the one in the jpeg I
posted this would not be an easy Task.
> Enhance the text extraction capabilities
> ----------------------------------------
>
> Key: PDFBOX-2998
> URL: https://issues.apache.org/jira/browse/PDFBOX-2998
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Andreas Meier
> Attachments: DropCapExample1.pdf, DropCapExample2.pdf,
> DropCapExample3.pdf, DropCapExample4.pdf, DropCapExample5.pdf,
> DropCapSegmentation.jpg, TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as https://code.google.com/p/lapdftext which we
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]