shinpei wada created PDFBOX-2747:
------------------------------------

             Summary: pdfbox: garbled japanese txt output
                 Key: PDFBOX-2747
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2747
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.6
         Environment: mac osx 10.9.5
            Reporter: shinpei wada


 I am trying to convert this pdf into txt. 
http://www.kabupro.jp/edp/20130829/S000EDM7.pdf

The original pdf has the following text;
【提出書類】有価証券報告書
【根拠条文】金融商品取引法第24条第1項
【提出先】近畿財務局長
【提出日】平成22年6月28日
【事業年度】第27期(自 平成21年4月1日 至 平成22年3月31日)
【会社名】株式会社カネミツ
【英訳名】KANEMITSU CORPORATION

But converting it to text i get garbled output as per the below;
    ?????? ???????
    ?????? ????????24????
    ????? ??????
    ????? ??22???28?
    ?????? ?27??????21??????????22???31??
    ????? ????????
    ????? KANEMITSU CORPORATION

What is interesting is that pdfbox returns the half-width alphanumeric numbers 
("24", "22", "KANEMITSU"), but when i tru to use iText the output returns the 
Japanese characters, but not the alphanumeric characters that appear here.

The closest issue i could find was in here;
https://issues.apache.org/jira/browse/PDFBOX-1895

I did have similar issues with other pdf's when using pdfbox version 1.8.5, 
although this was resolved in 1.8.6 following this bug fix. 

The issue presented here seems unrelated though



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to