[jira] [Created] (PDFBOX-1222) PDFs created with idealsoftware.com's VPE are all wrong

Radek (Created) (JIRA) Sun, 05 Feb 2012 18:40:30 -0800

PDFs created with idealsoftware.com's VPE are all wrong
-------------------------------------------------------


                 Key: PDFBOX-1222
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1222
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 1.6.0
            Reporter: Radek


Follow the steps:

1. Download the example pdf I'll attach. It's the same as "example rich text 
format" pdf from idealsoftware.com but with text extraction protection disabled.

2a. java -jar pdfbox-app-1.6.0.jar ExtractText -sort rtf.pdf extr.txt

Actual results:
Text is all gibberish. If you look at it very carefully, sorting "reads" the 
text vertically and you find first characters of each line first, then second 
characters of each line, etc.
Moreover, on jdk7: java.lang.IllegalArgumentException: Comparison method 
violates its general contract! (that's the text position sorting comparator)

Poking around the code indicates that sorting is correct *if* character 
rotation was 270 degrees. It (correctly?) calculates it as zero instead.

2b. java -jar pdfbox-app-1.6.0.jar ExtractText rtf.pdf extr.txt

Actual results:
Text is fine, but each page is glued to a single line. Poking around the code 
indicates that character offsets go down correctly, but expected line height is 
huge (full page height or width?) and therefore they never go down sufficiently 
to trigger a newline detection.

So, there's something very wrong with character positions in those files, 
making pdfbox not extract text correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PDFBOX-1222) PDFs created with idealsoftware.com's VPE are all wrong

Reply via email to