[ https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771384#comment-17771384 ]
Tilman Hausherr commented on PDFBOX-5350: ----------------------------------------- Done. My first impression is that it looks good, altough a few rows are suspicious (always when TOP_10_MORE_IN_A has more than TOP_10_MORE_IN_B in the content_diffs_no_exceptions.xls file). [~jwmayfield] if you want to look at the files yourself enter "https://corpora.tika.apache.org/base/docs/" in your browser and add the URL from the A column. > Regression unicode mapping in Korean document > --------------------------------------------- > > Key: PDFBOX-5350 > URL: https://issues.apache.org/jira/browse/PDFBOX-5350 > Project: PDFBox > Issue Type: Bug > Components: FontBox > Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox > Reporter: John Mayfield > Priority: Major > Labels: regression > Attachments: KR1019900015076.pdf, KR1019980000128.pdf, > KR1019980000128_2_0_15.txt, KR1019980000128_2_0_25.txt, KR1020140140600.pdf, > reports_pdfbox_2.0.29_vs_3.0.0.tar.xz > > > The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode > mapping?), this was previously addressed in PDFBOX-4661 and resolved that > example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents > (included here) to now have incorrect text output. > PDFTextStripper stripper = new PDFTextStripper(); > PDDocument doc = PDDocument.load(new File("KR1019980000128.pdf")); > stripper.getText(doc); > Like in PDFBOX-4661 there are numerous warnings of the form: > WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe > I've attached the text dump of two versions, but in brief: > 2.0.15: 공개번호 (public number) > 2.0.25: 공개 > I only confirmed the issue in the versions listed above but presume the issue > persists >=2.0.18. > My reading of PDFBOX-4661 is there is something funky about these PDFs? > PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly > produces 공개뮈픸 so I can see there is something non-trivial here. > Any help is much appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org