[
https://issues.apache.org/jira/browse/PDFBOX-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed PDFBOX-2747.
-----------------------------------
Resolution: Implemented
Fix Version/s: 2.0.0
I tried with 1.8.9 (the current version) where it doesn't work either. However
it works with the unreleased version 2.0, which you can get with svn.
pdfbox.apache.org/downloads.html#scm
> pdfbox: garbled japanese txt output
> -----------------------------------
>
> Key: PDFBOX-2747
> URL: https://issues.apache.org/jira/browse/PDFBOX-2747
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.6
> Environment: mac osx 10.9.5
> Reporter: shinpei wada
> Fix For: 2.0.0
>
>
> I am trying to convert this pdf into txt.
> http://www.kabupro.jp/edp/20130829/S000EDM7.pdf
> The original pdf has the following text;
> 【提出書類】有価証券報告書
> 【根拠条文】金融商品取引法第24条第1項
> 【提出先】近畿財務局長
> 【提出日】平成22年6月28日
> 【事業年度】第27期(自 平成21年4月1日 至 平成22年3月31日)
> 【会社名】株式会社カネミツ
> 【英訳名】KANEMITSU CORPORATION
> But converting it to text i get garbled output as per the below;
> ?????? ???????
> ?????? ????????24????
> ????? ??????
> ????? ??22???28?
> ?????? ?27??????21??????????22???31??
> ????? ????????
> ????? KANEMITSU CORPORATION
> What is interesting is that pdfbox returns the half-width alphanumeric
> numbers ("24", "22", "KANEMITSU"), but when i tru to use iText the output
> returns the Japanese characters, but not the alphanumeric characters that
> appear here.
> The closest issue i could find was in here;
> https://issues.apache.org/jira/browse/PDFBOX-1895
> I did have similar issues with other pdf's when using pdfbox version 1.8.5,
> although this was resolved in 1.8.6 following this bug fix.
> The issue presented here seems unrelated though
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]