[jira] [Closed] (PDFBOX-561) Text extraction with PDFTextStripper is system file.encoding dependent. Override does not work.

JIRA Mon, 13 Oct 2014 10:27:02 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andreas Lehmkühler closed PDFBOX-561.
-------------------------------------
    Resolution: Fixed
      Assignee: Andreas Lehmkühler

This was solved sometime ago.

> Text extraction with PDFTextStripper is system file.encoding dependent. 
> Override does not work.
> -----------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-561
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-561
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3, 0.8.0-incubator
>            Reporter: d ferbas
>            Assignee: Andreas Lehmkühler
>         Attachments: blindtext_mit_bullets.pdf
>
>
> The text extraction depends on the jvm file.encoding setting. The "override" 
> new PDFTextStripper("utf-8") (since version 0.8.0) has no effect.
> If there are critical characters in a pdf file, the extracted string differs 
> dependent of the jvm system encoding. 
> It has to be possible to set the encoding for the extraction to ensure same 
> results independent of the default system encoding.
> Sample file: see attachment "blindtext_mit_bullets.pdf"
> Bullets #3 to #8 differ using utf-8 vs cp1252
> Be aware that the file.encoding setting only works if passed while starting 
> the jvm (-Dfile.encoding=utf-8). System.setProperty(..) does not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (PDFBOX-561) Text extraction with PDFTextStripper is system file.encoding dependent. Override does not work.

Reply via email to