[
https://issues.apache.org/jira/browse/PDFBOX-5387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508816#comment-17508816
]
Ryan Jackson commented on PDFBOX-5387:
--------------------------------------
[~lehmi]
Thank you for asking. I meant to add a couple other notes to this ticket but
failed to do so. Here are the two outstanding things in my mind:
# Technically we might consider using the {{bfchar}} operator instead of
{{bfrange}} when the logic allows only one element (in what would otherwise be
a range). That would be a small optimization. In my investigation of this
issue, I found that (for the small sample document I used - just using a few
numbers - so not representative of a "real" document), Adobe Acrobat (Pro) uses
{{bfchar}}. Apparently it must favor {{bfchar}} for small subsets.
# With regard to UTF-16 supplementary characters, I am *not* certain that the
existing code (copied below; see {{allowDestinationRange}}) actually does what
we want:
{code:java}
// Allow the new destination string if:
// 1. It is sequential with the previous one and differs only in the low-order
byte
// 2. The previous string does not contain any UTF-16 surrogates
return allowCodeRange(prevCode, nextCode) && prev.codePointCount(0,
prev.length()) == 1;
{code}
Notice that the call to {{String.codePointCount}} is essentially unchanged from
before. My question however (and what I was trying to get at through the unit
tests and that "TODO" comment) is whether or not that is actually giving you a
count of the surrogate pairs. If you ask for the number of Unicode code points
(essentially UTF-32) for the entire length of the string, I'd expect the count
to always be "one", unless somehow the string represents more than one visible
character (golang {{rune}}). I don't know the overall code well enough to say
how it is called in these cases.
The last question I have is whether or not I may have violated an assumption of
my own in {{allowCodeRange}} (which assumes 16-bit values) by calling it from
{{allowDestinationRange}} with values potentially greater than 16-bit (the code
point should reflect UTF-32).
I apologize that I did not mention these questions earlier, but I am glad that
we can discuss them and make any changes if necessary.
> ToUnicodeWriter.writeTo allows byte overflow in bfrange operator
> ----------------------------------------------------------------
>
> Key: PDFBOX-5387
> URL: https://issues.apache.org/jira/browse/PDFBOX-5387
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 2.0.25
> Reporter: Ryan Jackson
> Assignee: Andreas Lehmkühler
> Priority: Major
> Fix For: 2.0.26, 3.0.0 PDFBox
>
>
> The {{writeTo}} method of {{ToUnicodeWriter}} allows overflow in the
> low-order byte when writing the {{(begin/end)bfrange}} operator.
> As far as I can tell it is used only with the {{PDCIDFontType2Embedder}}
> class. I believe the bug exists in both the main trunk and in the 2.x branch.
> The code in question may be found
> [here|https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/ToUnicodeWriter.java#L133-L136]
> .
> The portion of the PDF specification (version 1.7) that bears upon this code
> is Section 5.9, Example 5.16.
> The existing code attempts to limit the range logic to changes less than or
> equal to 255 code points, but it fails to account for at least the following
> situation by allowing this (for example):
> [srcCode1 srcCode2 dstString]
> 03FF 0400 0036
> The overflow between srcCode1 and srcCode2 is not allowed by the
> specification and any text extraction will fail. The glyphs themselves render
> fine so it is not immediately obvious there is a problem until one tries to
> examine the text by using the Content Panel or by copy/pasting from Acrobat
> (Pro) to some other document. By contrast the following bfrange operator does
> allow the text extraction to work as intended:
> [srcCode1 srcCode2 dstString]
> 03FE 03FF 0035
> Notice that no overflow exists, and as such the requirements of the
> specification are met.
> I have put together a proposed solution
> [here|https://github.com/ryanjackson-wf/pdfbox/pull/1] in my fork of the
> PDFBox GH mirror.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]