[jira] [Created] (PDFBOX-5350) Regression unicode mapping in Korean document

John Mayfield (Jira) Tue, 21 Dec 2021 13:52:20 -0800

John Mayfield created PDFBOX-5350:
-------------------------------------

             Summary: Regression unicode mapping in Korean document
                 Key: PDFBOX-5350
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5350
             Project: PDFBox
          Issue Type: Bug
          Components: FontBox
    Affects Versions: 2.0.25, 2.0.20, 2.0.18, 2.0.16
            Reporter: John Mayfield
         Attachments: KR1019900015076.pdf, KR1019980000128.pdf, 
KR1019980000128_2_0_15.txt, KR1019980000128_2_0_25.txt, KR1020140140600.pdf


The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode 
mapping?), this was previously addressed in PDFBOX-4661 and resolved that 
example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents 
(included here) to now have incorrect text output.

PDFTextStripper stripper = new PDFTextStripper();
PDDocument doc = PDDocument.load(new File("KR1019980000128.pdf"));
stripper.getText(doc);

Like in PDFBOX-4661 there are numerous warnings of the form:

WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe

I've attached the text dump of two versions, but in brief:

2.0.15: 공개번호 (public number)

2.0.25: 공개 

I only confirmed the issue in the versions listed above but presume the issue 
persists >=2.0.18.

My reading of PDFBOX-4661 is there is something funky about these PDFs? PDFBOX 
v2.0.15 had the correct text output. Testing in PDF.js incorrectly produces 
공개뮈픸 so I can see there is something non-trivial here.

Any help is much appreciated.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-5350) Regression unicode mapping in Korean document

Reply via email to