Inconsistencies in TextPositionComparator and sortByPosition
------------------------------------------------------------

                 Key: PDFBOX-731
                 URL: https://issues.apache.org/jira/browse/PDFBOX-731
             Project: PDFBox
          Issue Type: Bug
          Components: Utilities
    Affects Versions: 1.1.0
         Environment: Any / all
            Reporter: Michael van Rooyen


Specifying sortByPosition on PDFTextStripper can result in scrambling of text.  
The problem is caused largely by inconsistencies in TextPositionComparator, 
which does not always satisfy the required comparator constraint that if a < b 
and b < c, then a < c.  As a result, a true sort is sometimes not achievable.  
This is caused by the comparator being too flexible with what is regarded as 
being on the same "line".

I modified the comparator to be more strict when deciding which characters are 
on the same line, specifically:

1. Two pieces of text can't be on the same line if one's font is double or more 
the size of the other's.
2. Two pieces of text can't be on the same line if one's baseline is more than 
half the smaller font point size from the other's baseline.

I'm sure there are probably (superscript?) cases where these two conditions may 
be too strict, but at least they should (I think but haven't tried to prove :) 
result in a < b < c.  The comparator source I have used is below, feel free to 
use or modify it in any way.

Finally, PDFTextStripper needs to be more discriminating in inserting line 
breaks.  Specifically, if the x position of a text segment is < the x position 
of the last text segment, the there is an implicit line-break.  To fix this, I 
changed:

     if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))

to:

     if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine) || 
(sortByPosition && positionX < lastPosition.getXDirAdj()))

Revised comparator source:

public class TextPositionComparator implements Comparator
{
        private int strictCompare(Object o1, Object o2)
        {
                TextPosition pos1 = (TextPosition)o1;
        TextPosition pos2 = (TextPosition)o2;
        
        // Get the text direction adjusted coordinates
        
        float pos1YBottom = pos1.getYDirAdj();
        float pos2YBottom = pos2.getYDirAdj();

        if (pos1YBottom < pos2YBottom)
                return -1;
        else if (pos1YBottom > pos2YBottom)
                return 1;
        
        float x1 = pos1.getXDirAdj();
        float x2 = pos2.getXDirAdj();
        
        if (x1 < x2)
                return -1;
        else if (x1 > x2)
                return 1;
        
        return 0;
        }
        
        public int compare(Object o1, Object o2)
        {
                TextPosition pos1 = (TextPosition)o1;
        TextPosition pos2 = (TextPosition)o2;

        /* Only compare text that is in the same direction. */
        if (pos1.getDir() < pos2.getDir())
            return -1;
        else if (pos1.getDir() > pos2.getDir())
            return 1;

        float size1 = pos1.getFontSize();
        float size2 = pos2.getFontSize();
        
        if (size1 <= size2/2 || size1 >= size2*2)
                return strictCompare(o1, o2);

        float fontsize = size1;
        
        if (size2 < size1)
                fontsize = size2;
        
        float pos1YBottom = pos1.getYDirAdj();
        float pos2YBottom = pos2.getYDirAdj();

        if (pos1YBottom <= pos2YBottom - fontsize/2 || pos1YBottom >= 
pos2YBottom + fontsize/2)
                return strictCompare(o1, o2);
        
        // Get the text direction adjusted coordinates
        float x1 = pos1.getXDirAdj();
        float x2 = pos2.getXDirAdj();

        if (x1 < x2)
                return -1;
        else if (x1 > x2)
                return 1;
        
        return 0;
        }
}

YMMV.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to