[ 
https://issues.apache.org/jira/browse/PDFBOX-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255190#comment-14255190
 ] 

ASF subversion and git services commented on PDFBOX-1874:
---------------------------------------------------------

Commit 1647158 from [~lehmi] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1647158 ]

PDFBOX-1874: adjust precision to avoid false results when comparing floats as 
proposed by Yuri Burrows

> PDFTextStripper.isParagraphSeparation(...)
> ------------------------------------------
>
>                 Key: PDFBOX-1874
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1874
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.3
>         Environment: Eclipse
>            Reporter: Yuri Burrows
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>              Labels: patch
>
> PDFTextStripper.isParagraphSeparation(...) seems to have an issue with how it 
> finds Y text indentation.
> PROBLEM:
> I believe the issue is due to precision in the the following logic:
> {code}
>             float yGap = Math.abs(position.getTextPosition().getYDirAdj()-
>                     lastPosition.getTextPosition().getYDirAdj());
>             float xGap = (position.getTextPosition().getXDirAdj()-
>                     lastLineStartPosition.getTextPosition().getXDirAdj());
>             if(yGap > (getDropThreshold()*maxHeightForLine))
>             {
>                         result = true;
> {code}
> yGap has a precision to 1000th+, while (getDropThreshold()*maxHeightForLine) 
> has a precision to 100,000th. Resulting in the following comparison (example):
> 16.018 > 16.018005
> which evaluates to "True". However 16.018 > 16.018 would evaluate to "False".
> EFFECT OF THE PROBLEM:
> every line in the output is marked as "isParagraphStart = true" and 
> "writeParagraphEnd() ... = true".
> I.E. 
> |||NEW_LINE|||
> |||PARAGRAPH_START|||PDFBox has been designed to represent PDF documents 
> using familiar object-oriented paradigms. The data|||NEW_LINE|||
> contained in a PDF document is a collection of basic object types: arrays, 
> booleans, dictionaries, numbers,|||NEW_LINE|||
> |||PARAGRAPH_END||||||NEW_LINE|||
> |||PARAGRAPH_START|||strings and binary streams. PDFBox captures these basic 
> object types in the org.pdfbox.cos package (the|||NEW_LINE|||
> COS Model). While it's possible to create any desired interactions with a PDF 
> document using only these|||NEW_LINE|||
> |||PARAGRAPH_END||||||NEW_LINE|||
> In the source PDF these lines appear as such:
> "PDFBox has been designed to represent PDF documents using familiar 
> object-oriented paradigms. The data
> contained in a PDF document is a collection of basic object types: arrays, 
> booleans, dictionaries, numbers,
> strings and binary streams. PDFBox captures these basic object types in the 
> org.pdfbox.cos package (the
> COS Model). While it's possible to create any desired interactions with a PDF 
> document using only these"
> MY WORKAROUND:
> NOTE: there is a small performance hit with this workaround.
> {code}
>        float yGap = Math.abs(position.getTextPosition().getYDirAdj()
>        - lastPosition.getTextPosition().getYDirAdj());
>       
>        DecimalFormat df = new DecimalFormat("#.00");
>        float yGapTruncated = Float.valueOf(df.format(yGap));
>       
>        float newYVal = Float.valueOf(df.format(getDropThreshold()
>        * maxHeightForLine));
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to