[ 
https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770917#comment-17770917
 ] 

John Mayfield commented on PDFBOX-5350:
---------------------------------------

I believe the attached PDF to this issue has an embedded CMap? I don't mind if 
strictMode is default but it would be handy to have the option turn this off. 
These are Korean patent documents (public domain) and useful to be able to 
extract test from. Currently I am stuck on pdfbox v2.0.14.

The other thing I was contemplating as a work around was pre-parsing bytes of 
the PDF and splitting the byte range into smaller chunks.

Any other suggestions/work arounds are welcome. 

> Regression unicode mapping in Korean document
> ---------------------------------------------
>
>                 Key: PDFBOX-5350
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5350
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox
>            Reporter: John Mayfield
>            Priority: Major
>              Labels: regression
>         Attachments: KR1019900015076.pdf, KR1019980000128.pdf, 
> KR1019980000128_2_0_15.txt, KR1019980000128_2_0_25.txt, KR1020140140600.pdf
>
>
> The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode 
> mapping?), this was previously addressed in PDFBOX-4661 and resolved that 
> example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents 
> (included here) to now have incorrect text output.
> PDFTextStripper stripper = new PDFTextStripper();
> PDDocument doc = PDDocument.load(new File("KR1019980000128.pdf"));
> stripper.getText(doc);
> Like in PDFBOX-4661 there are numerous warnings of the form:
> WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe
> I've attached the text dump of two versions, but in brief:
> 2.0.15: 공개번호 (public number)
> 2.0.25: 공개 
> I only confirmed the issue in the versions listed above but presume the issue 
> persists >=2.0.18.
> My reading of PDFBOX-4661 is there is something funky about these PDFs? 
> PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly 
> produces 공개뮈픸 so I can see there is something non-trivial here.
> Any help is much appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to