[jira] [Updated] (PDFBOX-1222) PDFs created with idealsoftware.com's VPE are all wrong

Radek (Updated) (JIRA) Sun, 05 Feb 2012 18:42:33 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Radek updated PDFBOX-1222:
--------------------------

    Attachment: rtf.pdf

example file
                
> PDFs created with idealsoftware.com's VPE are all wrong
> -------------------------------------------------------
>
>                 Key: PDFBOX-1222
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1222
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Radek
>         Attachments: rtf.pdf
>
>
> Follow the steps:
> 1. Download the example pdf I'll attach. It's the same as "example rich text 
> format" pdf from idealsoftware.com but with text extraction protection 
> disabled.
> 2a. java -jar pdfbox-app-1.6.0.jar ExtractText -sort rtf.pdf extr.txt
> Actual results:
> Text is all gibberish. If you look at it very carefully, sorting "reads" the 
> text vertically and you find first characters of each line first, then second 
> characters of each line, etc.
> Moreover, on jdk7: java.lang.IllegalArgumentException: Comparison method 
> violates its general contract! (that's the text position sorting comparator)
> Poking around the code indicates that sorting is correct *if* character 
> rotation was 270 degrees. It (correctly?) calculates it as zero instead.
> 2b. java -jar pdfbox-app-1.6.0.jar ExtractText rtf.pdf extr.txt
> Actual results:
> Text is fine, but each page is glued to a single line. Poking around the code 
> indicates that character offsets go down correctly, but expected line height 
> is huge (full page height or width?) and therefore they never go down 
> sufficiently to trigger a newline detection.
> So, there's something very wrong with character positions in those files, 
> making pdfbox not extract text correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1222) PDFs created with idealsoftware.com's VPE are all wrong

Reply via email to