[ 
https://issues.apache.org/jira/browse/PDFBOX-5387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508816#comment-17508816
 ] 

Ryan Jackson commented on PDFBOX-5387:
--------------------------------------

[~lehmi] 

Thank you for asking. I meant to add a couple other notes to this ticket but 
failed to do so. Here are the two outstanding things in my mind:
 # Technically we might consider using the {{bfchar}} operator instead of 
{{bfrange}} when the logic allows only one element (in what would otherwise be 
a range). That would be a small optimization. In my investigation of this 
issue, I found that (for the small sample document I used - just using a few 
numbers - so not representative of a "real" document), Adobe Acrobat (Pro) uses 
{{bfchar}}. Apparently it must favor {{bfchar}} for small subsets.
# With regard to UTF-16 supplementary characters, I am *not* certain that the 
existing code (copied below; see {{allowDestinationRange}}) actually does what 
we want:

{code:java}
// Allow the new destination string if:
// 1. It is sequential with the previous one and differs only in the low-order 
byte
// 2. The previous string does not contain any UTF-16 surrogates
return allowCodeRange(prevCode, nextCode) && prev.codePointCount(0, 
prev.length()) == 1;
{code}

Notice that the call to {{String.codePointCount}} is essentially unchanged from 
before. My question however (and what I was trying to get at through the unit 
tests and that "TODO" comment) is whether or not that is actually giving you a 
count of the surrogate pairs. If you ask for the number of Unicode code points 
(essentially UTF-32) for the entire length of the string, I'd expect the count 
to always be "one", unless somehow the string represents more than one visible 
character (golang {{rune}}). I don't know the overall code well enough to say 
how it is called in these cases.

The last question I have is whether or not I may have violated an assumption of 
my own in {{allowCodeRange}} (which assumes 16-bit values) by calling it from 
{{allowDestinationRange}} with values potentially greater than 16-bit (the code 
point should reflect UTF-32).

I apologize that I did not mention these questions earlier, but I am glad that 
we can discuss them and make any changes if necessary.

> ToUnicodeWriter.writeTo allows byte overflow in bfrange operator
> ----------------------------------------------------------------
>
>                 Key: PDFBOX-5387
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5387
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 2.0.25
>            Reporter: Ryan Jackson
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 2.0.26, 3.0.0 PDFBox
>
>
> The {{writeTo}} method of {{ToUnicodeWriter}} allows overflow in the 
> low-order byte when writing the {{(begin/end)bfrange}} operator.
> As far as I can tell it is used only with the {{PDCIDFontType2Embedder}} 
> class. I believe the bug exists in both the main trunk and in the 2.x branch. 
> The code in question may be found 
> [here|https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/ToUnicodeWriter.java#L133-L136]
>  .
> The portion of the PDF specification (version 1.7) that bears upon this code 
> is Section 5.9, Example 5.16.
> The existing code attempts to limit the range logic to changes less than or 
> equal to 255 code points, but it fails to account for at least the following 
> situation by allowing this (for example):
> [srcCode1 srcCode2 dstString]
> 03FF 0400 0036
> The overflow between srcCode1 and srcCode2 is not allowed by the 
> specification and any text extraction will fail. The glyphs themselves render 
> fine so it is not immediately obvious there is a problem until one tries to 
> examine the text by using the Content Panel or by copy/pasting from Acrobat 
> (Pro) to some other document. By contrast the following bfrange operator does 
> allow the text extraction to work as intended:
> [srcCode1 srcCode2 dstString]
> 03FE 03FF 0035
> Notice that no overflow exists, and as such the requirements of the 
> specification are met.
> I have put together a proposed solution 
> [here|https://github.com/ryanjackson-wf/pdfbox/pull/1] in my fork of the 
> PDFBox GH mirror.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to