[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

Maruan Sahyoun (JIRA) Thu, 01 Oct 2015 06:04:51 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939783#comment-14939783
 ]


Maruan Sahyoun commented on PDFBOX-2998:
----------------------------------------

we can look at it from two angles - the 'block' structure of a document or the 
individual characters. If you do know the 'blocks' you are interested in you 
can already use PDFTextStripperByArea 
http://pdfbox.apache.org/docs/2.0.0-SNAPSHOT/javadocs/org/apache/pdfbox/text/PDFTextStripperByArea.html.
 So it should be possible to leverage products which are already able to 
generate that for you or custom solutions.

OTOH looking at the current output there are some issue with the lower level 
objects e.g. words are intermixed in some cases even if they belong to the same 
block. I'd think enhancing PDFBox in a way that the lower level handling is 
enhance is important. As a sample why do we look at characters only if there is 
a 'word' in the PDF in out current code? To be able to look at the block level 
(or paragraphs, or lines or ...) the lower level information must be correct.

Little disclaimer here - I wasn't involved in the text extraction code before 
looking at PDFBOX-2252 (which I was only able to handle because of you 
initiative). So some of the observations might not be relevant.

So I think the base line from what I get from the comments being made is that 
higher level analysis is domain dependent and PDFBox should concentrate on 
providing the information to make such solutions easier to build (on top and 
outside of PDFBox). If that is the consensus let's concentrate on that.



> Enhance the text extraction capabilities
> ----------------------------------------
>
>                 Key: PDFBOX-2998
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2998
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Meier
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

Reply via email to