[jira] [Updated] (PDFBOX-3499) PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF

Tilman Hausherr (JIRA) Thu, 15 Sep 2016 13:50:44 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-3499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr updated PDFBOX-3499:
------------------------------------
    Attachment: nihao2_unc.pdf

nihao2_unc.pdf is your file uncompressed, so that you can look with an editor 
at it. Of course the best is to look at it with PDFDebugger :-)

IMHO the producer of your PDF is to blame. Send them this issue and tell them 
their ToUnicode / CMap is wrong.

> PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF
> ---------------------------------------------------------------------------
>
>                 Key: PDFBOX-3499
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3499
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.2
>            Reporter: Kaleb Akalework
>            Priority: Critical
>         Attachments: AppBody-Sample-Chinese.pdf, UnicodeTest.pdf, nihao2.pdf, 
> nihao2_unc.pdf
>
>
> I'm trying to use PDFBox 2.0.2 to parse PDF files that contain Japanese and 
> Chinese characters, but for some reason it does parse it correctly. Every 
> character that is extracted is changed to the first letter in the line. For 
> example if the document contains 早上好, this, the extracted text will correctly 
> know that it has 3 characters but all 3 characters will be 早早早, the last two 
> characters are replaced by the first character. This same string is correctly 
> parsed, in a word document.  I was trying to use this with Tika-13, which was 
> is PDFBOX 2.0.2. Under Tim Allisons (From Tika) advice i tried it with PDFBOX 
> 2.0.3. And I still see the same problem. The following is the code I used.
> {code}
> import java.io.File;
> import java.io.IOException;
> import org.apache.pdfbox.cos.COSDocument;
> import org.apache.pdfbox.io.RandomAccessFile;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.text.PDFTextStripper;
> public class PDFBoxTesting
> {
>     private static PDFParser parser;
>     private static PDFTextStripper pdfStripper;
>     private static PDDocument pdDoc;
>     private static COSDocument cosDoc;
>     private static String Text;
>     private static String filePath;
>     private static File file;
>     public static String ToText() throws IOException
>     {
>         pdfStripper = null;
>         pdDoc = null;
>         cosDoc = null;
>         filePath = "C:\\Users\\kaleba\\Desktop\\nihao2.pdf";
>         file = new File(filePath);
>         parser = new PDFParser(new RandomAccessFile(file, "r")); // update 
> for PDFBox V 2.0 
>         parser.parse();
>         cosDoc = parser.getDocument();
>         pdfStripper = new PDFTextStripper();
>         pdDoc = new PDDocument(cosDoc);
>         pdDoc.getNumberOfPages();
>         pdfStripper.setStartPage(1);
>         pdfStripper.setEndPage(10); // reading text from page 1 to 10 
>         // if you want to get text from full pdf file use this code
>         // pdfStripper.setEndPage(pdDoc.getNumberOfPages()); 
>         Text = pdfStripper.getText(pdDoc); // put breakpoint after executing 
> getTtext. 
>         return Text;
>     }
>     public static void main(String[] args)
>     {
>         try
>         {
>             ToText();
>         }
>         catch (Exception e)
>         {
>             int i = 1;
>         }
>     }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-3499) PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF

Reply via email to