Inconsistencies in TextPositionComparator and sortByPosition
------------------------------------------------------------
Key: PDFBOX-731
URL: https://issues.apache.org/jira/browse/PDFBOX-731
Project: PDFBox
Issue Type: Bug
Components: Utilities
Affects Versions: 1.1.0
Environment: Any / all
Reporter: Michael van Rooyen
Specifying sortByPosition on PDFTextStripper can result in scrambling of text.
The problem is caused largely by inconsistencies in TextPositionComparator,
which does not always satisfy the required comparator constraint that if a < b
and b < c, then a < c. As a result, a true sort is sometimes not achievable.
This is caused by the comparator being too flexible with what is regarded as
being on the same "line".
I modified the comparator to be more strict when deciding which characters are
on the same line, specifically:
1. Two pieces of text can't be on the same line if one's font is double or more
the size of the other's.
2. Two pieces of text can't be on the same line if one's baseline is more than
half the smaller font point size from the other's baseline.
I'm sure there are probably (superscript?) cases where these two conditions may
be too strict, but at least they should (I think but haven't tried to prove :)
result in a < b < c. The comparator source I have used is below, feel free to
use or modify it in any way.
Finally, PDFTextStripper needs to be more discriminating in inserting line
breaks. Specifically, if the x position of a text segment is < the x position
of the last text segment, the there is an implicit line-break. To fix this, I
changed:
if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
to:
if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine) ||
(sortByPosition && positionX < lastPosition.getXDirAdj()))
Revised comparator source:
public class TextPositionComparator implements Comparator
{
private int strictCompare(Object o1, Object o2)
{
TextPosition pos1 = (TextPosition)o1;
TextPosition pos2 = (TextPosition)o2;
// Get the text direction adjusted coordinates
float pos1YBottom = pos1.getYDirAdj();
float pos2YBottom = pos2.getYDirAdj();
if (pos1YBottom < pos2YBottom)
return -1;
else if (pos1YBottom > pos2YBottom)
return 1;
float x1 = pos1.getXDirAdj();
float x2 = pos2.getXDirAdj();
if (x1 < x2)
return -1;
else if (x1 > x2)
return 1;
return 0;
}
public int compare(Object o1, Object o2)
{
TextPosition pos1 = (TextPosition)o1;
TextPosition pos2 = (TextPosition)o2;
/* Only compare text that is in the same direction. */
if (pos1.getDir() < pos2.getDir())
return -1;
else if (pos1.getDir() > pos2.getDir())
return 1;
float size1 = pos1.getFontSize();
float size2 = pos2.getFontSize();
if (size1 <= size2/2 || size1 >= size2*2)
return strictCompare(o1, o2);
float fontsize = size1;
if (size2 < size1)
fontsize = size2;
float pos1YBottom = pos1.getYDirAdj();
float pos2YBottom = pos2.getYDirAdj();
if (pos1YBottom <= pos2YBottom - fontsize/2 || pos1YBottom >=
pos2YBottom + fontsize/2)
return strictCompare(o1, o2);
// Get the text direction adjusted coordinates
float x1 = pos1.getXDirAdj();
float x2 = pos2.getXDirAdj();
if (x1 < x2)
return -1;
else if (x1 > x2)
return 1;
return 0;
}
}
YMMV.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.