[
https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771802#comment-17771802
]
John Mayfield commented on PDFBOX-5350:
---------------------------------------
I should preface that I don't speak Korean but I dumped the text outputs (on
trunk) with the strict mode on/off of the three most changed (DICE_COEF) files.
As with my example patent files the output with strict mode on has many
characters missing.
strict=true
{{ 미 남장 한 회에 해 1955 남신학 학 가 립 후 내 많 거듭하여 습}}
{{니다. 사들에 해 워진 우리 학 사명 가지고 7800여명 업생들 출하 , 회 }}
{{ 계 에 심 역할 잘 감당하는 훈 장 도 합니다.}}
strict=false (same section)
{{ 미국남장로교 한국선교회에 의해 1955년 호남신학대학교가 설립된 이후 내외적으로 많은 발전을 거듭하여 왔습}}
{{니다. 선교사들에 의해 세워진 우리 대학은 선교적 사명을 가지고 7800여명의 졸업생들을 배출하였으며, 교회 선교}}
{{와 세계 선교에 중심적인 역할을 잘 감당하는 영성 훈련의 장이기도 합니다.}}
> Regression unicode mapping in Korean document
> ---------------------------------------------
>
> Key: PDFBOX-5350
> URL: https://issues.apache.org/jira/browse/PDFBOX-5350
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox
> Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox
> Reporter: John Mayfield
> Priority: Major
> Labels: regression
> Attachments:
> 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_non-strict.txt,
> 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_strict.txt,
> FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_non-strict.txt,
> FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_strict.txt,
> JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_non-strict.txt,
> JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_strict.txt, KR1019900015076.pdf,
> KR1019980000128.pdf, KR1019980000128_2_0_15.txt, KR1019980000128_2_0_25.txt,
> KR1020140140600.pdf, reports_pdfbox_2.0.29_vs_3.0.0.tar.xz
>
>
> The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode
> mapping?), this was previously addressed in PDFBOX-4661 and resolved that
> example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents
> (included here) to now have incorrect text output.
> PDFTextStripper stripper = new PDFTextStripper();
> PDDocument doc = PDDocument.load(new File("KR1019980000128.pdf"));
> stripper.getText(doc);
> Like in PDFBOX-4661 there are numerous warnings of the form:
> WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe
> I've attached the text dump of two versions, but in brief:
> 2.0.15: 공개번호 (public number)
> 2.0.25: 공개
> I only confirmed the issue in the versions listed above but presume the issue
> persists >=2.0.18.
> My reading of PDFBOX-4661 is there is something funky about these PDFs?
> PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly
> produces 공개뮈픸 so I can see there is something non-trivial here.
> Any help is much appreciated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]