shinpei wada created PDFBOX-2747:
------------------------------------
Summary: pdfbox: garbled japanese txt output
Key: PDFBOX-2747
URL: https://issues.apache.org/jira/browse/PDFBOX-2747
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.6
Environment: mac osx 10.9.5
Reporter: shinpei wada
I am trying to convert this pdf into txt.
http://www.kabupro.jp/edp/20130829/S000EDM7.pdf
The original pdf has the following text;
【提出書類】有価証券報告書
【根拠条文】金融商品取引法第24条第1項
【提出先】近畿財務局長
【提出日】平成22年6月28日
【事業年度】第27期(自 平成21年4月1日 至 平成22年3月31日)
【会社名】株式会社カネミツ
【英訳名】KANEMITSU CORPORATION
But converting it to text i get garbled output as per the below;
?????? ???????
?????? ????????24????
????? ??????
????? ??22???28?
?????? ?27??????21??????????22???31??
????? ????????
????? KANEMITSU CORPORATION
What is interesting is that pdfbox returns the half-width alphanumeric numbers
("24", "22", "KANEMITSU"), but when i tru to use iText the output returns the
Japanese characters, but not the alphanumeric characters that appear here.
The closest issue i could find was in here;
https://issues.apache.org/jira/browse/PDFBOX-1895
I did have similar issues with other pdf's when using pdfbox version 1.8.5,
although this was resolved in 1.8.6 following this bug fix.
The issue presented here seems unrelated though
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]