Yuri Burrows created PDFBOX-1874:
------------------------------------

             Summary: PDFTextStripper.isParagraphSeparation(...)
                 Key: PDFBOX-1874
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1874
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.3
         Environment: Eclipse
            Reporter: Yuri Burrows
            Priority: Minor


PDFTextStripper.isParagraphSeparation(...) seems to have an issue with how it 
finds Y text indentation.

PROBLEM:
I believe the issue is due to precision in the the following logic:
            float yGap = Math.abs(position.getTextPosition().getYDirAdj()-
                    lastPosition.getTextPosition().getYDirAdj());
            float xGap = (position.getTextPosition().getXDirAdj()-
                    lastLineStartPosition.getTextPosition().getXDirAdj());

            if(yGap > (getDropThreshold()*maxHeightForLine))
            {
                        result = true;

yGap has a precision to 1000th+, while (getDropThreshold()*maxHeightForLine) 
has a precision to 100,000th. Resulting in the following comparison (example):
16.018 > 16.018005
which evaluates to "True". However 16.018 > 16.018 would evaluate to "False".

EFFECT OF THE PROBLEM:
every line in the output is marked as "isParagraphStart = true" and 
"writeParagraphEnd() ... = true".
I.E. 
|||NEW_LINE|||
|||PARAGRAPH_START|||PDFBox has been designed to represent PDF documents using 
familiar object-oriented paradigms. The data|||NEW_LINE|||
contained in a PDF document is a collection of basic object types: arrays, 
booleans, dictionaries, numbers,|||NEW_LINE|||
|||PARAGRAPH_END||||||NEW_LINE|||
|||PARAGRAPH_START|||strings and binary streams. PDFBox captures these basic 
object types in the org.pdfbox.cos package (the|||NEW_LINE|||
COS Model). While it's possible to create any desired interactions with a PDF 
document using only these|||NEW_LINE|||
|||PARAGRAPH_END||||||NEW_LINE|||

In the source PDF these lines appear as such:
"PDFBox has been designed to represent PDF documents using familiar 
object-oriented paradigms. The data
contained in a PDF document is a collection of basic object types: arrays, 
booleans, dictionaries, numbers,
strings and binary streams. PDFBox captures these basic object types in the 
org.pdfbox.cos package (the
COS Model). While it's possible to create any desired interactions with a PDF 
document using only these"

MY WORKAROUND:
NOTE: there is a small performance hit with this workaround.

         float yGap = Math.abs(position.getTextPosition().getYDirAdj()
         - lastPosition.getTextPosition().getYDirAdj());
        
         DecimalFormat df = new DecimalFormat("#.00");
         float yGapTruncated = Float.valueOf(df.format(yGap));
        
         float newYVal = Float.valueOf(df.format(getDropThreshold()
         * maxHeightForLine));




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to