[jira] [Updated] (PDFBOX-5350) Regression unicode mapping in Korean document

2023-10-08 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5350:

Component/s: Text extraction
 (was: FontBox)

> Regression unicode mapping in Korean document
> -
>
> Key: PDFBOX-5350
> URL: https://issues.apache.org/jira/browse/PDFBOX-5350
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox
>Reporter: John Mayfield
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: regression
> Fix For: 2.0.30, 3.0.1 PDFBox, 4.0.0
>
> Attachments: 
> 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_non-strict.txt, 
> 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_strict.txt, 
> FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_non-strict.txt, 
> FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_strict.txt, 
> JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_non-strict.txt, 
> JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_strict.txt, KR1019900015076.pdf, 
> KR101998128.pdf, KR101998128_2_0_15.txt, KR101998128_2_0_25.txt, 
> KR1020140140600.pdf, 
> PDFBOX-5350-JX57O5E5YG6XM4FZABPULQGTW4OXPCWA-p1-reduced.pdf, 
> reports_pdfbox_2.0.29_vs_3.0.0.tar.xz
>
>
> The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode 
> mapping?), this was previously addressed in PDFBOX-4661 and resolved that 
> example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents 
> (included here) to now have incorrect text output.
> PDFTextStripper stripper = new PDFTextStripper();
> PDDocument doc = PDDocument.load(new File("KR101998128.pdf"));
> stripper.getText(doc);
> Like in PDFBOX-4661 there are numerous warnings of the form:
> WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe
> I've attached the text dump of two versions, but in brief:
> 2.0.15: 공개번호 (public number)
> 2.0.25: 공개 
> I only confirmed the issue in the versions listed above but presume the issue 
> persists >=2.0.18.
> My reading of PDFBOX-4661 is there is something funky about these PDFs? 
> PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly 
> produces 공개뮈픸 so I can see there is something non-trivial here.
> Any help is much appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5350) Regression unicode mapping in Korean document

2023-10-08 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5350:

Attachment: PDFBOX-5350-JX57O5E5YG6XM4FZABPULQGTW4OXPCWA-p1-reduced.pdf

> Regression unicode mapping in Korean document
> -
>
> Key: PDFBOX-5350
> URL: https://issues.apache.org/jira/browse/PDFBOX-5350
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox
>Reporter: John Mayfield
>Priority: Major
>  Labels: regression
> Attachments: 
> 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_non-strict.txt, 
> 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_strict.txt, 
> FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_non-strict.txt, 
> FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_strict.txt, 
> JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_non-strict.txt, 
> JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_strict.txt, KR1019900015076.pdf, 
> KR101998128.pdf, KR101998128_2_0_15.txt, KR101998128_2_0_25.txt, 
> KR1020140140600.pdf, 
> PDFBOX-5350-JX57O5E5YG6XM4FZABPULQGTW4OXPCWA-p1-reduced.pdf, 
> reports_pdfbox_2.0.29_vs_3.0.0.tar.xz
>
>
> The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode 
> mapping?), this was previously addressed in PDFBOX-4661 and resolved that 
> example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents 
> (included here) to now have incorrect text output.
> PDFTextStripper stripper = new PDFTextStripper();
> PDDocument doc = PDDocument.load(new File("KR101998128.pdf"));
> stripper.getText(doc);
> Like in PDFBOX-4661 there are numerous warnings of the form:
> WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe
> I've attached the text dump of two versions, but in brief:
> 2.0.15: 공개번호 (public number)
> 2.0.25: 공개 
> I only confirmed the issue in the versions listed above but presume the issue 
> persists >=2.0.18.
> My reading of PDFBOX-4661 is there is something funky about these PDFs? 
> PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly 
> produces 공개뮈픸 so I can see there is something non-trivial here.
> Any help is much appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5350) Regression unicode mapping in Korean document

2023-10-04 Thread John Mayfield (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Mayfield updated PDFBOX-5350:
--
Attachment: JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_strict.txt
7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_strict.txt
FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_strict.txt
JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_non-strict.txt
7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_non-strict.txt
FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_non-strict.txt

> Regression unicode mapping in Korean document
> -
>
> Key: PDFBOX-5350
> URL: https://issues.apache.org/jira/browse/PDFBOX-5350
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox
>Reporter: John Mayfield
>Priority: Major
>  Labels: regression
> Attachments: 
> 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_non-strict.txt, 
> 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_strict.txt, 
> FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_non-strict.txt, 
> FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_strict.txt, 
> JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_non-strict.txt, 
> JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_strict.txt, KR1019900015076.pdf, 
> KR101998128.pdf, KR101998128_2_0_15.txt, KR101998128_2_0_25.txt, 
> KR1020140140600.pdf, reports_pdfbox_2.0.29_vs_3.0.0.tar.xz
>
>
> The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode 
> mapping?), this was previously addressed in PDFBOX-4661 and resolved that 
> example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents 
> (included here) to now have incorrect text output.
> PDFTextStripper stripper = new PDFTextStripper();
> PDDocument doc = PDDocument.load(new File("KR101998128.pdf"));
> stripper.getText(doc);
> Like in PDFBOX-4661 there are numerous warnings of the form:
> WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe
> I've attached the text dump of two versions, but in brief:
> 2.0.15: 공개번호 (public number)
> 2.0.25: 공개 
> I only confirmed the issue in the versions listed above but presume the issue 
> persists >=2.0.18.
> My reading of PDFBOX-4661 is there is something funky about these PDFs? 
> PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly 
> produces 공개뮈픸 so I can see there is something non-trivial here.
> Any help is much appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5350) Regression unicode mapping in Korean document

2023-10-03 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5350:

Attachment: reports_pdfbox_2.0.29_vs_3.0.0.tar.xz

> Regression unicode mapping in Korean document
> -
>
> Key: PDFBOX-5350
> URL: https://issues.apache.org/jira/browse/PDFBOX-5350
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox
>Reporter: John Mayfield
>Priority: Major
>  Labels: regression
> Attachments: KR1019900015076.pdf, KR101998128.pdf, 
> KR101998128_2_0_15.txt, KR101998128_2_0_25.txt, KR1020140140600.pdf, 
> reports_pdfbox_2.0.29_vs_3.0.0.tar.xz
>
>
> The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode 
> mapping?), this was previously addressed in PDFBOX-4661 and resolved that 
> example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents 
> (included here) to now have incorrect text output.
> PDFTextStripper stripper = new PDFTextStripper();
> PDDocument doc = PDDocument.load(new File("KR101998128.pdf"));
> stripper.getText(doc);
> Like in PDFBOX-4661 there are numerous warnings of the form:
> WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe
> I've attached the text dump of two versions, but in brief:
> 2.0.15: 공개번호 (public number)
> 2.0.25: 공개 
> I only confirmed the issue in the versions listed above but presume the issue 
> persists >=2.0.18.
> My reading of PDFBOX-4661 is there is something funky about these PDFs? 
> PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly 
> produces 공개뮈픸 so I can see there is something non-trivial here.
> Any help is much appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5350) Regression unicode mapping in Korean document

2023-09-26 Thread John Mayfield (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Mayfield updated PDFBOX-5350:
--
Affects Version/s: 3.0.0 PDFBox

> Regression unicode mapping in Korean document
> -
>
> Key: PDFBOX-5350
> URL: https://issues.apache.org/jira/browse/PDFBOX-5350
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox
>Reporter: John Mayfield
>Priority: Major
>  Labels: regression
> Attachments: KR1019900015076.pdf, KR101998128.pdf, 
> KR101998128_2_0_15.txt, KR101998128_2_0_25.txt, KR1020140140600.pdf
>
>
> The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode 
> mapping?), this was previously addressed in PDFBOX-4661 and resolved that 
> example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents 
> (included here) to now have incorrect text output.
> PDFTextStripper stripper = new PDFTextStripper();
> PDDocument doc = PDDocument.load(new File("KR101998128.pdf"));
> stripper.getText(doc);
> Like in PDFBOX-4661 there are numerous warnings of the form:
> WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe
> I've attached the text dump of two versions, but in brief:
> 2.0.15: 공개번호 (public number)
> 2.0.25: 공개 
> I only confirmed the issue in the versions listed above but presume the issue 
> persists >=2.0.18.
> My reading of PDFBOX-4661 is there is something funky about these PDFs? 
> PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly 
> produces 공개뮈픸 so I can see there is something non-trivial here.
> Any help is much appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5350) Regression unicode mapping in Korean document

2021-12-22 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5350:

Labels: regression  (was: )

> Regression unicode mapping in Korean document
> -
>
> Key: PDFBOX-5350
> URL: https://issues.apache.org/jira/browse/PDFBOX-5350
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25
>Reporter: John Mayfield
>Priority: Major
>  Labels: regression
> Attachments: KR1019900015076.pdf, KR101998128.pdf, 
> KR101998128_2_0_15.txt, KR101998128_2_0_25.txt, KR1020140140600.pdf
>
>
> The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode 
> mapping?), this was previously addressed in PDFBOX-4661 and resolved that 
> example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents 
> (included here) to now have incorrect text output.
> PDFTextStripper stripper = new PDFTextStripper();
> PDDocument doc = PDDocument.load(new File("KR101998128.pdf"));
> stripper.getText(doc);
> Like in PDFBOX-4661 there are numerous warnings of the form:
> WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe
> I've attached the text dump of two versions, but in brief:
> 2.0.15: 공개번호 (public number)
> 2.0.25: 공개 
> I only confirmed the issue in the versions listed above but presume the issue 
> persists >=2.0.18.
> My reading of PDFBOX-4661 is there is something funky about these PDFs? 
> PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly 
> produces 공개뮈픸 so I can see there is something non-trivial here.
> Any help is much appreciated.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org