[ 
https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771472#comment-17771472
 ] 

Tilman Hausherr commented on PDFBOX-5350:
-----------------------------------------

I assume the results would be similar but I didn't test with 3.0 because Apache 
Tika (which was used for this test) doesn't use it yet (although it's being 
prepared).

"TOP_10_MORE_IN_A has more" were only 3 lines, e.g. V195. That one turned out 
to be unrelated to this issue. The text extraction for that one is incorrect 
for some bullet points, but aligns with Adobe Reader so it's correct. Other 
cases I looked at were harmless. I leave the Korean ones for you.

> Regression unicode mapping in Korean document
> ---------------------------------------------
>
>                 Key: PDFBOX-5350
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5350
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox
>            Reporter: John Mayfield
>            Priority: Major
>              Labels: regression
>         Attachments: KR1019900015076.pdf, KR1019980000128.pdf, 
> KR1019980000128_2_0_15.txt, KR1019980000128_2_0_25.txt, KR1020140140600.pdf, 
> reports_pdfbox_2.0.29_vs_3.0.0.tar.xz
>
>
> The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode 
> mapping?), this was previously addressed in PDFBOX-4661 and resolved that 
> example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents 
> (included here) to now have incorrect text output.
> PDFTextStripper stripper = new PDFTextStripper();
> PDDocument doc = PDDocument.load(new File("KR1019980000128.pdf"));
> stripper.getText(doc);
> Like in PDFBOX-4661 there are numerous warnings of the form:
> WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe
> I've attached the text dump of two versions, but in brief:
> 2.0.15: 공개번호 (public number)
> 2.0.25: 공개 
> I only confirmed the issue in the versions listed above but presume the issue 
> persists >=2.0.18.
> My reading of PDFBOX-4661 is there is something funky about these PDFs? 
> PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly 
> produces 공개뮈픸 so I can see there is something non-trivial here.
> Any help is much appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to