[ 
https://issues.apache.org/jira/browse/PDFBOX-5961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17929160#comment-17929160
 ] 

Tilman Hausherr edited comment on PDFBOX-5961 at 2/21/25 1:13 PM:
------------------------------------------------------------------

The problematic number is the number sign, but as UTF8. But the PDF 
specification mentions "It shall use the beginbfchar, endbfchar, beginbfrange, 
and endbfrange operators to define the mapping from character codes to Unicode 
character sequences expressed in UTF-16BE encoding."
However Adobe Reader is able to extract it.


was (Author: tilman):
The problematic number is the number sign, but as UTF8. But the PDF 
specification mentions "It shall use the beginbfchar, endbfchar, beginbfrange, 
and endbfrange operators to define the mapping from character codes to Unicode 
character sequences expressed in UTF-16BE encoding."

> IllegalArgumentException: Not a valid Unicode code point: 0xE28496
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-5961
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5961
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Tilman Hausherr
>            Priority: Major
>         Attachments: PDFJS-19527.pdf
>
>
> {noformat}
> IllegalArgumentException: Not a valid Unicode code point: 0xE28496
>     java.base/java.lang.String.valueOfCodePoint(String.java:3345)
>     java.base/java.lang.Character.toString(Character.java:8053)
>     org.apache.pdfbox.pdmodel.font.PDType0Font.toUnicode(PDType0Font.java:548)
>     org.apache.pdfbox.pdmodel.font.PDFont.toUnicode(PDFont.java:450)
>     
> org.apache.pdfbox.text.LegacyPDFStreamEngine.showGlyph(LegacyPDFStreamEngine.java:279)
>     
> org.apache.pdfbox.debugger.pagepane.DebugTextOverlay$DebugTextStripper.showGlyph(DebugTextOverlay.java:209)
>     
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:792)
>     
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:651)
> {noformat}
> The problems are somehow related to the /ToUnicode stream at 
> {{Root/Pages/Kids/[0]/Resources/Font/F3/ToUnicode}}. This is a different bug 
> than PDFBOX-5960 and not the problem that is in PDF.js 19527. I played around 
> a bit supporting 3 byte codes (memo for me: version before 21.2 12:20) but 
> it's still the same exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to