[jira] [Updated] (PDFBOX-5350) Regression unicode mapping in Korean document
[ https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5350: Component/s: Text extraction (was: FontBox) > Regression unicode mapping in Korean document > - > > Key: PDFBOX-5350 > URL: https://issues.apache.org/jira/browse/PDFBOX-5350 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox >Reporter: John Mayfield >Assignee: Tilman Hausherr >Priority: Major > Labels: regression > Fix For: 2.0.30, 3.0.1 PDFBox, 4.0.0 > > Attachments: > 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_non-strict.txt, > 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_strict.txt, > FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_non-strict.txt, > FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_strict.txt, > JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_non-strict.txt, > JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_strict.txt, KR1019900015076.pdf, > KR101998128.pdf, KR101998128_2_0_15.txt, KR101998128_2_0_25.txt, > KR1020140140600.pdf, > PDFBOX-5350-JX57O5E5YG6XM4FZABPULQGTW4OXPCWA-p1-reduced.pdf, > reports_pdfbox_2.0.29_vs_3.0.0.tar.xz > > > The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode > mapping?), this was previously addressed in PDFBOX-4661 and resolved that > example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents > (included here) to now have incorrect text output. > PDFTextStripper stripper = new PDFTextStripper(); > PDDocument doc = PDDocument.load(new File("KR101998128.pdf")); > stripper.getText(doc); > Like in PDFBOX-4661 there are numerous warnings of the form: > WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe > I've attached the text dump of two versions, but in brief: > 2.0.15: 공개번호 (public number) > 2.0.25: 공개 > I only confirmed the issue in the versions listed above but presume the issue > persists >=2.0.18. > My reading of PDFBOX-4661 is there is something funky about these PDFs? > PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly > produces 공개뮈픸 so I can see there is something non-trivial here. > Any help is much appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5350) Regression unicode mapping in Korean document
[ https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5350: Attachment: PDFBOX-5350-JX57O5E5YG6XM4FZABPULQGTW4OXPCWA-p1-reduced.pdf > Regression unicode mapping in Korean document > - > > Key: PDFBOX-5350 > URL: https://issues.apache.org/jira/browse/PDFBOX-5350 > Project: PDFBox > Issue Type: Bug > Components: FontBox >Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox >Reporter: John Mayfield >Priority: Major > Labels: regression > Attachments: > 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_non-strict.txt, > 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_strict.txt, > FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_non-strict.txt, > FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_strict.txt, > JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_non-strict.txt, > JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_strict.txt, KR1019900015076.pdf, > KR101998128.pdf, KR101998128_2_0_15.txt, KR101998128_2_0_25.txt, > KR1020140140600.pdf, > PDFBOX-5350-JX57O5E5YG6XM4FZABPULQGTW4OXPCWA-p1-reduced.pdf, > reports_pdfbox_2.0.29_vs_3.0.0.tar.xz > > > The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode > mapping?), this was previously addressed in PDFBOX-4661 and resolved that > example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents > (included here) to now have incorrect text output. > PDFTextStripper stripper = new PDFTextStripper(); > PDDocument doc = PDDocument.load(new File("KR101998128.pdf")); > stripper.getText(doc); > Like in PDFBOX-4661 there are numerous warnings of the form: > WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe > I've attached the text dump of two versions, but in brief: > 2.0.15: 공개번호 (public number) > 2.0.25: 공개 > I only confirmed the issue in the versions listed above but presume the issue > persists >=2.0.18. > My reading of PDFBOX-4661 is there is something funky about these PDFs? > PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly > produces 공개뮈픸 so I can see there is something non-trivial here. > Any help is much appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5350) Regression unicode mapping in Korean document
[ https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Mayfield updated PDFBOX-5350: -- Attachment: JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_strict.txt 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_strict.txt FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_strict.txt JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_non-strict.txt 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_non-strict.txt FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_non-strict.txt > Regression unicode mapping in Korean document > - > > Key: PDFBOX-5350 > URL: https://issues.apache.org/jira/browse/PDFBOX-5350 > Project: PDFBox > Issue Type: Bug > Components: FontBox >Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox >Reporter: John Mayfield >Priority: Major > Labels: regression > Attachments: > 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_non-strict.txt, > 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_strict.txt, > FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_non-strict.txt, > FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_strict.txt, > JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_non-strict.txt, > JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_strict.txt, KR1019900015076.pdf, > KR101998128.pdf, KR101998128_2_0_15.txt, KR101998128_2_0_25.txt, > KR1020140140600.pdf, reports_pdfbox_2.0.29_vs_3.0.0.tar.xz > > > The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode > mapping?), this was previously addressed in PDFBOX-4661 and resolved that > example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents > (included here) to now have incorrect text output. > PDFTextStripper stripper = new PDFTextStripper(); > PDDocument doc = PDDocument.load(new File("KR101998128.pdf")); > stripper.getText(doc); > Like in PDFBOX-4661 there are numerous warnings of the form: > WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe > I've attached the text dump of two versions, but in brief: > 2.0.15: 공개번호 (public number) > 2.0.25: 공개 > I only confirmed the issue in the versions listed above but presume the issue > persists >=2.0.18. > My reading of PDFBOX-4661 is there is something funky about these PDFs? > PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly > produces 공개뮈픸 so I can see there is something non-trivial here. > Any help is much appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5350) Regression unicode mapping in Korean document
[ https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5350: Attachment: reports_pdfbox_2.0.29_vs_3.0.0.tar.xz > Regression unicode mapping in Korean document > - > > Key: PDFBOX-5350 > URL: https://issues.apache.org/jira/browse/PDFBOX-5350 > Project: PDFBox > Issue Type: Bug > Components: FontBox >Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox >Reporter: John Mayfield >Priority: Major > Labels: regression > Attachments: KR1019900015076.pdf, KR101998128.pdf, > KR101998128_2_0_15.txt, KR101998128_2_0_25.txt, KR1020140140600.pdf, > reports_pdfbox_2.0.29_vs_3.0.0.tar.xz > > > The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode > mapping?), this was previously addressed in PDFBOX-4661 and resolved that > example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents > (included here) to now have incorrect text output. > PDFTextStripper stripper = new PDFTextStripper(); > PDDocument doc = PDDocument.load(new File("KR101998128.pdf")); > stripper.getText(doc); > Like in PDFBOX-4661 there are numerous warnings of the form: > WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe > I've attached the text dump of two versions, but in brief: > 2.0.15: 공개번호 (public number) > 2.0.25: 공개 > I only confirmed the issue in the versions listed above but presume the issue > persists >=2.0.18. > My reading of PDFBOX-4661 is there is something funky about these PDFs? > PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly > produces 공개뮈픸 so I can see there is something non-trivial here. > Any help is much appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5350) Regression unicode mapping in Korean document
[ https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Mayfield updated PDFBOX-5350: -- Affects Version/s: 3.0.0 PDFBox > Regression unicode mapping in Korean document > - > > Key: PDFBOX-5350 > URL: https://issues.apache.org/jira/browse/PDFBOX-5350 > Project: PDFBox > Issue Type: Bug > Components: FontBox >Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox >Reporter: John Mayfield >Priority: Major > Labels: regression > Attachments: KR1019900015076.pdf, KR101998128.pdf, > KR101998128_2_0_15.txt, KR101998128_2_0_25.txt, KR1020140140600.pdf > > > The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode > mapping?), this was previously addressed in PDFBOX-4661 and resolved that > example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents > (included here) to now have incorrect text output. > PDFTextStripper stripper = new PDFTextStripper(); > PDDocument doc = PDDocument.load(new File("KR101998128.pdf")); > stripper.getText(doc); > Like in PDFBOX-4661 there are numerous warnings of the form: > WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe > I've attached the text dump of two versions, but in brief: > 2.0.15: 공개번호 (public number) > 2.0.25: 공개 > I only confirmed the issue in the versions listed above but presume the issue > persists >=2.0.18. > My reading of PDFBOX-4661 is there is something funky about these PDFs? > PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly > produces 공개뮈픸 so I can see there is something non-trivial here. > Any help is much appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5350) Regression unicode mapping in Korean document
[ https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5350: Labels: regression (was: ) > Regression unicode mapping in Korean document > - > > Key: PDFBOX-5350 > URL: https://issues.apache.org/jira/browse/PDFBOX-5350 > Project: PDFBox > Issue Type: Bug > Components: FontBox >Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25 >Reporter: John Mayfield >Priority: Major > Labels: regression > Attachments: KR1019900015076.pdf, KR101998128.pdf, > KR101998128_2_0_15.txt, KR101998128_2_0_25.txt, KR1020140140600.pdf > > > The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode > mapping?), this was previously addressed in PDFBOX-4661 and resolved that > example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents > (included here) to now have incorrect text output. > PDFTextStripper stripper = new PDFTextStripper(); > PDDocument doc = PDDocument.load(new File("KR101998128.pdf")); > stripper.getText(doc); > Like in PDFBOX-4661 there are numerous warnings of the form: > WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe > I've attached the text dump of two versions, but in brief: > 2.0.15: 공개번호 (public number) > 2.0.25: 공개 > I only confirmed the issue in the versions listed above but presume the issue > persists >=2.0.18. > My reading of PDFBOX-4661 is there is something funky about these PDFs? > PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly > produces 공개뮈픸 so I can see there is something non-trivial here. > Any help is much appreciated. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org