[ https://issues.apache.org/jira/browse/PDFBOX-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255190#comment-14255190 ]
ASF subversion and git services commented on PDFBOX-1874: --------------------------------------------------------- Commit 1647158 from [~lehmi] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1647158 ] PDFBOX-1874: adjust precision to avoid false results when comparing floats as proposed by Yuri Burrows > PDFTextStripper.isParagraphSeparation(...) > ------------------------------------------ > > Key: PDFBOX-1874 > URL: https://issues.apache.org/jira/browse/PDFBOX-1874 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.3 > Environment: Eclipse > Reporter: Yuri Burrows > Assignee: Andreas Lehmkühler > Priority: Minor > Labels: patch > > PDFTextStripper.isParagraphSeparation(...) seems to have an issue with how it > finds Y text indentation. > PROBLEM: > I believe the issue is due to precision in the the following logic: > {code} > float yGap = Math.abs(position.getTextPosition().getYDirAdj()- > lastPosition.getTextPosition().getYDirAdj()); > float xGap = (position.getTextPosition().getXDirAdj()- > lastLineStartPosition.getTextPosition().getXDirAdj()); > if(yGap > (getDropThreshold()*maxHeightForLine)) > { > result = true; > {code} > yGap has a precision to 1000th+, while (getDropThreshold()*maxHeightForLine) > has a precision to 100,000th. Resulting in the following comparison (example): > 16.018 > 16.018005 > which evaluates to "True". However 16.018 > 16.018 would evaluate to "False". > EFFECT OF THE PROBLEM: > every line in the output is marked as "isParagraphStart = true" and > "writeParagraphEnd() ... = true". > I.E. > |||NEW_LINE||| > |||PARAGRAPH_START|||PDFBox has been designed to represent PDF documents > using familiar object-oriented paradigms. The data|||NEW_LINE||| > contained in a PDF document is a collection of basic object types: arrays, > booleans, dictionaries, numbers,|||NEW_LINE||| > |||PARAGRAPH_END||||||NEW_LINE||| > |||PARAGRAPH_START|||strings and binary streams. PDFBox captures these basic > object types in the org.pdfbox.cos package (the|||NEW_LINE||| > COS Model). While it's possible to create any desired interactions with a PDF > document using only these|||NEW_LINE||| > |||PARAGRAPH_END||||||NEW_LINE||| > In the source PDF these lines appear as such: > "PDFBox has been designed to represent PDF documents using familiar > object-oriented paradigms. The data > contained in a PDF document is a collection of basic object types: arrays, > booleans, dictionaries, numbers, > strings and binary streams. PDFBox captures these basic object types in the > org.pdfbox.cos package (the > COS Model). While it's possible to create any desired interactions with a PDF > document using only these" > MY WORKAROUND: > NOTE: there is a small performance hit with this workaround. > {code} > float yGap = Math.abs(position.getTextPosition().getYDirAdj() > - lastPosition.getTextPosition().getYDirAdj()); > > DecimalFormat df = new DecimalFormat("#.00"); > float yGapTruncated = Float.valueOf(df.format(yGap)); > > float newYVal = Float.valueOf(df.format(getDropThreshold() > * maxHeightForLine)); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)