[
https://issues.apache.org/jira/browse/PDFBOX-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760203#comment-13760203
]
SCHAEFER B.S. commented on PDFBOX-1512:
---------------------------------------
As Andreas Lehmkühler pointed out, the problem lies in the check of overlays in
the code, but changing the code breaks previous implementations.
We did a try/catch around the sort to avoid the break of the sort algorithm
<code>
try {
Collections.sort(textList,
WFI_PDFParser_TextPositionComparator.getInstance());
} catch (Exception ex) {
// Sort algorithm break contract -> do sorting in safemode
Collections.sort(textList,
WFI_PDFParser_TextPositionComparator.getSafeInstance());
}
</code>
and modified the compare method like this:
<code>
@Override
public int compare(Object o1, Object o2) {
int result;
TextPosition pos1 = (TextPosition) o1;
TextPosition pos2 = (TextPosition) o2;
/* Only compare text that is in the same direction. */
if (pos1.getDir() < pos2.getDir()) {
result = -1;
} else if (pos1.getDir() > pos2.getDir()) {
result = 1;
} else {
// Get the text direction adjusted coordinates
float x1 = pos1.getXDirAdj();
float x2 = pos2.getXDirAdj();
float pos1YBottom = pos1.getYDirAdj();
float pos2YBottom = pos2.getYDirAdj();
// note that the coordinates have been adjusted so 0,0 is in upper left
float pos1YTop = pos1YBottom - pos1.getHeightDir();
float pos2YTop = pos2YBottom - pos2.getHeightDir();
float ydiff = Math.abs(pos1YBottom - pos2YBottom);
boolean issmallydiff = ydiff < .1;
if (_safemode) {
// Do not check for overlaps here
if (issmallydiff) {
result = compareX(x1, x2);
} else {
if (pos1YBottom > pos2YBottom) {
result = 1;
} else if (pos1YBottom < pos2YBottom) {
result = -1;
} else {
result = compareX(x1, x2);
}
}
} else {
boolean ispos1overlap = (pos1YBottom >= pos2YTop && pos1YBottom <=
pos2YBottom);
boolean ispos2overlap = (pos2YBottom >= pos1YTop && pos2YBottom <=
pos1YBottom);
if (issmallydiff || ispos1overlap || ispos2overlap) {
result = compareX(x1, x2);
} else {
if (pos1YBottom > pos2YBottom) {
result = 1;
} else if (pos1YBottom < pos2YBottom) {
result = -1;
} else {
result = compareX(x1, x2);
}
}
}
}
return result;
}
private int compareX(float x1, float x2) {
if (x1 < x2) {
return -1;
} else if (x1 > x2) {
return 1;
} else {
return 0;
}
}
</code>
Maybe this helps ...
> TextPositionComparator is not compatible with Java 7
> ----------------------------------------------------
>
> Key: PDFBOX-1512
> URL: https://issues.apache.org/jira/browse/PDFBOX-1512
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.7.1
> Environment: Java 7
> Reporter: Benjamin Papez
> Assignee: Andreas Lehmkühler
> Attachments: immo-kurier_arsenal_93x62.pdf,
> TextPositionComparator.java
>
>
> The TextPostionCompartor causes the following exception running on Java 7:
> Unexpected RuntimeException from
> org.apache.tika.parser.ParserDecorator$1@9007fa2 Original cause: Comparison
> method violates its general contract!
> I think the problem is with this check:
> if ( yDifference < .1 ||
> (pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom) ||
> (pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom))
> as it violates the contract requirement:
> The implementor must also ensure that the relation is transitive:
> ((compare(x, y)>0) && (compare(y, z)>0)) implies compare(x, z)>0.
> Finally, the implementor must ensure that compare(x, y)==0 implies that
> sgn(compare(x, z))==sgn(compare(y, z)) for all z.
> Java 7 now is strict and throws exceptions when the contract is violated.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira