[ https://issues.apache.org/jira/browse/PDFBOX-398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrian Romano updated PDFBOX-398: --------------------------------- Attachment: garbage output.jpg Added screenshot of garbage output viewed in ultra edit. This file was extracted with UTF-8 encoding. > Russian extraction encoding failure > ----------------------------------- > > Key: PDFBOX-398 > URL: https://issues.apache.org/jira/browse/PDFBOX-398 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 0.7.3, 0.8.0-incubator > Environment: Windows XP 32-bit, CentOS 5.2 32-bit > Reporter: Adrian Romano > Attachments: 7.pdf, garbage output.jpg, working output.jpg > > > I am doing some text extraction of Russian documents and some of them aren't > extracting correctly. I am using PDFTextStripper. > When I extract on windows using UTF-8 encoding, the output is garbage. > When I extract on linux using any encoding, the output is garbage. > The only way I can get viable output is when I extract the PDF on windows, > but don't specify an encoding. If I do this the output is correct when viewed > with Ultra Edit, but not in notepad. I can view the output in notepad only > after I convert the file to utf-8 with iconv. > It appears to me that the encoding isn't being read correctly from the PDF, > and when it's > outputted as UTF-8, it is being double encoded or something. I can detect > this double encoding, and then > run the file with no encoding specified, then convert it to UTF-8 using > iconv, and it is OK. > But, this method does not work on linux, as I cannot get the file to extract > correctly using any encoding > on linux. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.