[ https://issues.apache.org/jira/browse/PDFBOX-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler reassigned PDFBOX-731: ----------------------------------------- Assignee: Andreas Lehmkühler > Inconsistencies in TextPositionComparator and sortByPosition > ------------------------------------------------------------ > > Key: PDFBOX-731 > URL: https://issues.apache.org/jira/browse/PDFBOX-731 > Project: PDFBox > Issue Type: Bug > Components: Utilities > Affects Versions: 1.1.0, 2.0.0 > Environment: Any / all > Reporter: Michael van Rooyen > Assignee: Andreas Lehmkühler > Fix For: 2.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Specifying sortByPosition on PDFTextStripper can result in scrambling of > text. The problem is caused largely by inconsistencies in > TextPositionComparator, which does not always satisfy the required comparator > constraint that if a < b and b < c, then a < c. As a result, a true sort is > sometimes not achievable. This is caused by the comparator being too > flexible with what is regarded as being on the same "line". > I modified the comparator to be more strict when deciding which characters > are on the same line, specifically: > 1. Two pieces of text can't be on the same line if one's font is double or > more the size of the other's. > 2. Two pieces of text can't be on the same line if one's baseline is more > than half the smaller font point size from the other's baseline. > I'm sure there are probably (superscript?) cases where these two conditions > may be too strict, but at least they should (I think but haven't tried to > prove :) result in a < b < c. The comparator source I have used is below, > feel free to use or modify it in any way. > Finally, PDFTextStripper needs to be more discriminating in inserting line > breaks. Specifically, if the x position of a text segment is < the x > position of the last text segment, the there is an implicit line-break. To > fix this, I changed: > {code} > if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine)) > {code} > to: > {code} > if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine) || > (sortByPosition && positionX < lastPosition.getXDirAdj())) > {code} > Revised comparator source: > {code} > public class TextPositionComparator implements Comparator > { > private int strictCompare(Object o1, Object o2) > { > TextPosition pos1 = (TextPosition)o1; > TextPosition pos2 = (TextPosition)o2; > > // Get the text direction adjusted coordinates > > float pos1YBottom = pos1.getYDirAdj(); > float pos2YBottom = pos2.getYDirAdj(); > if (pos1YBottom < pos2YBottom) > return -1; > else if (pos1YBottom > pos2YBottom) > return 1; > > float x1 = pos1.getXDirAdj(); > float x2 = pos2.getXDirAdj(); > > if (x1 < x2) > return -1; > else if (x1 > x2) > return 1; > > return 0; > } > > public int compare(Object o1, Object o2) > { > TextPosition pos1 = (TextPosition)o1; > TextPosition pos2 = (TextPosition)o2; > /* Only compare text that is in the same direction. */ > if (pos1.getDir() < pos2.getDir()) > return -1; > else if (pos1.getDir() > pos2.getDir()) > return 1; > float size1 = pos1.getFontSize(); > float size2 = pos2.getFontSize(); > > if (size1 <= size2/2 || size1 >= size2*2) > return strictCompare(o1, o2); > float fontsize = size1; > > if (size2 < size1) > fontsize = size2; > > float pos1YBottom = pos1.getYDirAdj(); > float pos2YBottom = pos2.getYDirAdj(); > if (pos1YBottom <= pos2YBottom - fontsize/2 || pos1YBottom >= > pos2YBottom + fontsize/2) > return strictCompare(o1, o2); > > // Get the text direction adjusted coordinates > float x1 = pos1.getXDirAdj(); > float x2 = pos2.getXDirAdj(); > if (x1 < x2) > return -1; > else if (x1 > x2) > return 1; > > return 0; > } > } > {code} > YMMV. -- This message was sent by Atlassian JIRA (v6.3.4#6332)