[ 
https://issues.apache.org/jira/browse/PDFBOX-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler closed PDFBOX-1222.
--------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.7.0
         Assignee: Andreas Lehmkühler

The text extraction works fine since PDFBox 1.7.0. The "The Comparison method 
violates its general contract" no longer appears starting with 1.7.0 too.


> PDFs created with idealsoftware.com's VPE are all wrong
> -------------------------------------------------------
>
>                 Key: PDFBOX-1222
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1222
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Radek
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.7.0
>
>         Attachments: rtf.pdf
>
>
> Follow the steps:
> 1. Download the example pdf I'll attach. It's the same as "example rich text 
> format" pdf from idealsoftware.com but with text extraction protection 
> disabled.
> 2a. java -jar pdfbox-app-1.6.0.jar ExtractText -sort rtf.pdf extr.txt
> Actual results:
> Text is all gibberish. If you look at it very carefully, sorting "reads" the 
> text vertically and you find first characters of each line first, then second 
> characters of each line, etc.
> Moreover, on jdk7: java.lang.IllegalArgumentException: Comparison method 
> violates its general contract! (that's the text position sorting comparator)
> Poking around the code indicates that sorting is correct *if* character 
> rotation was 270 degrees. It (correctly?) calculates it as zero instead.
> 2b. java -jar pdfbox-app-1.6.0.jar ExtractText rtf.pdf extr.txt
> Actual results:
> Text is fine, but each page is glued to a single line. Poking around the code 
> indicates that character offsets go down correctly, but expected line height 
> is huge (full page height or width?) and therefore they never go down 
> sufficiently to trigger a newline detection.
> So, there's something very wrong with character positions in those files, 
> making pdfbox not extract text correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to