[
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934792#comment-14934792
]
Maruan Sahyoun commented on PDFBOX-2998:
----------------------------------------
I'd rather keep that open - or find another way discussion that topic.
I agree that writing a complete layout analysis tool is complex and may or may
not be something which should reside in pdfbox. OTOH I think it's good to
consider such tools and research in order to think about what might be useful.
We have text extraction. That needs enhancements. Maybe a better line finding
algorithm. Maybe a better understanding if it's preferred to keep words (or
parts of them) if they are defined as such in the PDF or handle each character
individually as it's currently done. Maybe detecting vertical text, rotated
text better .... Given the experience you have I think you can provide a lot of
input (and maybe code) to enhance the current text extraction. I'm sure you can
outline some doable tasks others can work on.
In order to facilitate that we can change the title of the issue so there is no
wrong impression.
> Document layout analysis tools needed
> -------------------------------------
>
> Key: PDFBOX-2998
> URL: https://issues.apache.org/jira/browse/PDFBOX-2998
> Project: PDFBox
> Issue Type: New Feature
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Andreas Meier
>
> PDFBox will need some document layout analysis tools to extract text
> correctly.
> At the Moment the text of a document is extracted using the position of
> single characters.
> This may lead to wrong results, due to the format of the file
> For a good extraction, layout analysis and segmentation has to be done in a
> previous step.
> https://code.google.com/p/lapdftext
> Would be a good solution for a layout analysis tool, unfortunately, it
> heavily relies on other libraries and needs Java 1.6 to run.
> The layout analysis tool should segementate the file and return a list or set
> of rectangles.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]