[
https://issues.apache.org/jira/browse/PDFBOX-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radek updated PDFBOX-1222:
--------------------------
Attachment: rtf.pdf
example file
> PDFs created with idealsoftware.com's VPE are all wrong
> -------------------------------------------------------
>
> Key: PDFBOX-1222
> URL: https://issues.apache.org/jira/browse/PDFBOX-1222
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 1.6.0
> Reporter: Radek
> Attachments: rtf.pdf
>
>
> Follow the steps:
> 1. Download the example pdf I'll attach. It's the same as "example rich text
> format" pdf from idealsoftware.com but with text extraction protection
> disabled.
> 2a. java -jar pdfbox-app-1.6.0.jar ExtractText -sort rtf.pdf extr.txt
> Actual results:
> Text is all gibberish. If you look at it very carefully, sorting "reads" the
> text vertically and you find first characters of each line first, then second
> characters of each line, etc.
> Moreover, on jdk7: java.lang.IllegalArgumentException: Comparison method
> violates its general contract! (that's the text position sorting comparator)
> Poking around the code indicates that sorting is correct *if* character
> rotation was 270 degrees. It (correctly?) calculates it as zero instead.
> 2b. java -jar pdfbox-app-1.6.0.jar ExtractText rtf.pdf extr.txt
> Actual results:
> Text is fine, but each page is glued to a single line. Poking around the code
> indicates that character offsets go down correctly, but expected line height
> is huge (full page height or width?) and therefore they never go down
> sufficiently to trigger a newline detection.
> So, there's something very wrong with character positions in those files,
> making pdfbox not extract text correctly.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira