[
https://issues.apache.org/jira/browse/PDFBOX-3499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-3499:
------------------------------------
Attachment: nihao2_unc.pdf
nihao2_unc.pdf is your file uncompressed, so that you can look with an editor
at it. Of course the best is to look at it with PDFDebugger :-)
IMHO the producer of your PDF is to blame. Send them this issue and tell them
their ToUnicode / CMap is wrong.
> PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF
> ---------------------------------------------------------------------------
>
> Key: PDFBOX-3499
> URL: https://issues.apache.org/jira/browse/PDFBOX-3499
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.2
> Reporter: Kaleb Akalework
> Priority: Critical
> Attachments: AppBody-Sample-Chinese.pdf, UnicodeTest.pdf, nihao2.pdf,
> nihao2_unc.pdf
>
>
> I'm trying to use PDFBox 2.0.2 to parse PDF files that contain Japanese and
> Chinese characters, but for some reason it does parse it correctly. Every
> character that is extracted is changed to the first letter in the line. For
> example if the document contains 早上好, this, the extracted text will correctly
> know that it has 3 characters but all 3 characters will be 早早早, the last two
> characters are replaced by the first character. This same string is correctly
> parsed, in a word document. I was trying to use this with Tika-13, which was
> is PDFBOX 2.0.2. Under Tim Allisons (From Tika) advice i tried it with PDFBOX
> 2.0.3. And I still see the same problem. The following is the code I used.
> {code}
> import java.io.File;
> import java.io.IOException;
> import org.apache.pdfbox.cos.COSDocument;
> import org.apache.pdfbox.io.RandomAccessFile;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.text.PDFTextStripper;
> public class PDFBoxTesting
> {
> private static PDFParser parser;
> private static PDFTextStripper pdfStripper;
> private static PDDocument pdDoc;
> private static COSDocument cosDoc;
> private static String Text;
> private static String filePath;
> private static File file;
> public static String ToText() throws IOException
> {
> pdfStripper = null;
> pdDoc = null;
> cosDoc = null;
> filePath = "C:\\Users\\kaleba\\Desktop\\nihao2.pdf";
> file = new File(filePath);
> parser = new PDFParser(new RandomAccessFile(file, "r")); // update
> for PDFBox V 2.0
> parser.parse();
> cosDoc = parser.getDocument();
> pdfStripper = new PDFTextStripper();
> pdDoc = new PDDocument(cosDoc);
> pdDoc.getNumberOfPages();
> pdfStripper.setStartPage(1);
> pdfStripper.setEndPage(10); // reading text from page 1 to 10
> // if you want to get text from full pdf file use this code
> // pdfStripper.setEndPage(pdDoc.getNumberOfPages());
> Text = pdfStripper.getText(pdDoc); // put breakpoint after executing
> getTtext.
> return Text;
> }
> public static void main(String[] args)
> {
> try
> {
> ToText();
> }
> catch (Exception e)
> {
> int i = 1;
> }
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]