[ 
https://issues.apache.org/jira/browse/PDFBOX-5961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17929225#comment-17929225
 ] 

Tilman Hausherr commented on PDFBOX-5961:
-----------------------------------------

No improvements (not surprising, when doing debug output it was only the № that 
had more than 2 bytes). Also no changes in my own text extraction test files. 
We'll see in the "big" regression tests if there's anything new. I'll commit 
for the trunk during the weekend and then wait a bit before committing for the 
other versions.

> IllegalArgumentException: Not a valid Unicode code point: 0xE28496
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-5961
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5961
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Tilman Hausherr
>            Priority: Major
>         Attachments: PDFJS-19527.pdf
>
>
> {noformat}
> IllegalArgumentException: Not a valid Unicode code point: 0xE28496
>     java.base/java.lang.String.valueOfCodePoint(String.java:3345)
>     java.base/java.lang.Character.toString(Character.java:8053)
>     org.apache.pdfbox.pdmodel.font.PDType0Font.toUnicode(PDType0Font.java:548)
>     org.apache.pdfbox.pdmodel.font.PDFont.toUnicode(PDFont.java:450)
>     
> org.apache.pdfbox.text.LegacyPDFStreamEngine.showGlyph(LegacyPDFStreamEngine.java:279)
>     
> org.apache.pdfbox.debugger.pagepane.DebugTextOverlay$DebugTextStripper.showGlyph(DebugTextOverlay.java:209)
>     
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:792)
>     
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:651)
> {noformat}
> The problems are somehow related to the /ToUnicode stream at 
> {{Root/Pages/Kids/[0]/Resources/Font/F3/ToUnicode}}. This is a different bug 
> than PDFBOX-5960 and not the problem that is in PDF.js 19527. I played around 
> a bit supporting 3 byte codes (memo for me: version before 21.2 12:20) but 
> it's still the same exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to