[
https://issues.apache.org/jira/browse/PDFBOX-5387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509032#comment-17509032
]
Ryan Jackson commented on PDFBOX-5387:
--------------------------------------
I did a bit of reading concerning the {{String.codePointCount}} method and the
Java
[documentation|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#codePointCount(int,int)]
(Java 11) states the following:
"Returns the number of Unicode code points in the specified text range of this
String. The text range begins at the specified beginIndex and extends to the
char at index endIndex - 1. Thus the length (in chars) of the text range is
endIndex-beginIndex. Unpaired surrogates within the text range count as one
code point each."
So the existing test does guard against unpaired surrogates which would be
appropriate. I'd have to think about and study how surrogate pairs are formed
in UTF-16 in order to be able to comment on whether or not the regular PDF
algorithm ("increment low byte by one") may safely work here. I know little of
the PDF 1.5 specification as it applies to surrogate pairs.
This MS
[article|https://docs.microsoft.com/en-us/windows/win32/intl/surrogates-and-supplementary-characters]
on UTF-16 may also be helpful, but I'm sure you are already familiar with the
topic.
> ToUnicodeWriter.writeTo allows byte overflow in bfrange operator
> ----------------------------------------------------------------
>
> Key: PDFBOX-5387
> URL: https://issues.apache.org/jira/browse/PDFBOX-5387
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 2.0.25
> Reporter: Ryan Jackson
> Assignee: Andreas Lehmkühler
> Priority: Major
> Fix For: 2.0.26, 3.0.0 PDFBox
>
>
> The {{writeTo}} method of {{ToUnicodeWriter}} allows overflow in the
> low-order byte when writing the {{(begin/end)bfrange}} operator.
> As far as I can tell it is used only with the {{PDCIDFontType2Embedder}}
> class. I believe the bug exists in both the main trunk and in the 2.x branch.
> The code in question may be found
> [here|https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/ToUnicodeWriter.java#L133-L136]
> .
> The portion of the PDF specification (version 1.7) that bears upon this code
> is Section 5.9, Example 5.16.
> The existing code attempts to limit the range logic to changes less than or
> equal to 255 code points, but it fails to account for at least the following
> situation by allowing this (for example):
> [srcCode1 srcCode2 dstString]
> 03FF 0400 0036
> The overflow between srcCode1 and srcCode2 is not allowed by the
> specification and any text extraction will fail. The glyphs themselves render
> fine so it is not immediately obvious there is a problem until one tries to
> examine the text by using the Content Panel or by copy/pasting from Acrobat
> (Pro) to some other document. By contrast the following bfrange operator does
> allow the text extraction to work as intended:
> [srcCode1 srcCode2 dstString]
> 03FE 03FF 0035
> Notice that no overflow exists, and as such the requirements of the
> specification are met.
> I have put together a proposed solution
> [here|https://github.com/ryanjackson-wf/pdfbox/pull/1] in my fork of the
> PDFBox GH mirror.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]