[
https://issues.apache.org/jira/browse/PDFBOX-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
d ferbas updated PDFBOX-561:
----------------------------
Attachment: blindtext_mit_bullets.pdf
sample file for encoding problems
> Text extraction with PDFTextStripper is system file.encoding dependent.
> Override does not work.
> -----------------------------------------------------------------------------------------------
>
> Key: PDFBOX-561
> URL: https://issues.apache.org/jira/browse/PDFBOX-561
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3, 0.8.0-incubator
> Reporter: d ferbas
> Attachments: blindtext_mit_bullets.pdf
>
>
> The text extraction depends on the jvm file.encoding setting. The "override"
> new PDFTextStripper("utf-8") (since version 0.8.0) has no effect.
> If there are critical characters in a pdf file, the extracted string differs
> dependent of the jvm system encoding.
> It has to be possible to set the encoding for the extraction to ensure same
> results independent of the default system encoding.
> Sample file:
> http://e-nnovation.at/downloads/blindtext_mit_bullets_unsigned.pdf
> Bullets #3 to #8 differ using utf-8 vs cp1252
> Be aware that the file.encoding setting only works if passed while starting
> the jvm (-Dfile.encoding=utf-8). System.setProperty(..) does not work.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.