[ https://issues.apache.org/jira/browse/PDFBOX-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler closed PDFBOX-1222. -------------------------------------- Resolution: Fixed Fix Version/s: 1.7.0 Assignee: Andreas Lehmkühler The text extraction works fine since PDFBox 1.7.0. The "The Comparison method violates its general contract" no longer appears starting with 1.7.0 too. > PDFs created with idealsoftware.com's VPE are all wrong > ------------------------------------------------------- > > Key: PDFBOX-1222 > URL: https://issues.apache.org/jira/browse/PDFBOX-1222 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.6.0 > Reporter: Radek > Assignee: Andreas Lehmkühler > Fix For: 1.7.0 > > Attachments: rtf.pdf > > > Follow the steps: > 1. Download the example pdf I'll attach. It's the same as "example rich text > format" pdf from idealsoftware.com but with text extraction protection > disabled. > 2a. java -jar pdfbox-app-1.6.0.jar ExtractText -sort rtf.pdf extr.txt > Actual results: > Text is all gibberish. If you look at it very carefully, sorting "reads" the > text vertically and you find first characters of each line first, then second > characters of each line, etc. > Moreover, on jdk7: java.lang.IllegalArgumentException: Comparison method > violates its general contract! (that's the text position sorting comparator) > Poking around the code indicates that sorting is correct *if* character > rotation was 270 degrees. It (correctly?) calculates it as zero instead. > 2b. java -jar pdfbox-app-1.6.0.jar ExtractText rtf.pdf extr.txt > Actual results: > Text is fine, but each page is glued to a single line. Poking around the code > indicates that character offsets go down correctly, but expected line height > is huge (full page height or width?) and therefore they never go down > sufficiently to trigger a newline detection. > So, there's something very wrong with character positions in those files, > making pdfbox not extract text correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)