Russian extraction encoding failure ----------------------------------- Key: PDFBOX-398 URL: https://issues.apache.org/jira/browse/PDFBOX-398 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.7.3, 0.8.0-incubator Environment: Windows XP 32-bit, CentOS 5.2 32-bit Reporter: Adrian Romano
I am doing some text extraction of Russian documents and some of them aren't extracting correctly. I am using PDFTextStripper. When I extract on windows using UTF-8 encoding, the output is garbage. When I extract on linux using any encoding, the output is garbage. The only way I can get viable output is when I extract the PDF on windows, but don't specify an encoding. If I do this the output is correct when viewed with Ultra Edit, but not in notepad. I can view the output in notepad only after I convert the file to utf-8 with iconv. It appears to me that the encoding isn't being read correctly from the PDF, and when it's outputted as UTF-8, it is being double encoded or something. I can detect this double encoding, and then run the file with no encoding specified, then convert it to UTF-8 using iconv, and it is OK. But, this method does not work on linux, as I cannot get the file to extract correctly using any encoding on linux. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.