Re: Suspected bug in and proposed fix for ToUnicodeWriter.writeTo

Ryan Jackson Sat, 12 Mar 2022 06:40:03 -0800

Dear Andreas,

Thank you. I'll write up a ticket soon. I may not be able to get to it
until Monday MST (US) but will create it and add some sample PDF files to
the ticket. I also have a working Adobe Acrobat Pro example (they are using
the bfchar operator instead).


Vielen Dank!

Ryan.


On Sat, Mar 12, 2022 at 3:28 AM Andreas Lehmkuehler <[email protected]>
wrote:

> Hi,
>
> Am 11.03.22 um 21:49 schrieb Ryan Jackson:
> > Dear Apache Devs:
> >
> > I believe that I have identified a bug in the creation of the
> > (begin/end)bfrange operator used when embedding fonts with the
> > PDCIDFontType2Embedder class.
> >
> > The bug exists (as best I can tell) in both the main trunk and in the 2.0
> > branch. The code in question may be found here
> > <
> https://github.com/ryanjackson-wf/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/ToUnicodeWriter.java#L133-L136
> >.
> > The portion of the PDF specification (version 1.7) that bears upon this
> > code is Section 5.9, Example 5.16.
> >
> > The existing code attempts to limit the range logic to changes less than
> or
> > equal to 255 code points, but it fails to account for at least the
> > following situation by allowing this (for example):
> >
> > [srcCode1 srcCode2 dstString]
> > 03FF 0400 0036
> >
> > The overflow between srcCode1 and srcCode2 is not allowed by the
> > specification and any text extraction will fail. The glyphs themselves
> > render fine so it is not immediately obvious there is a problem until one
> > tries to examine the text by using the Content Panel or by copy/pasting
> > from Acrobat (Pro) to some other document. By contrast the following
> > bfrange operator does allow the text extraction to work as intended:
> >
> > [srcCode1 srcCode2 dstString]
> > 03FE 03FF 0035
> >
> > Notice that no overflow exists, and as such the requirements of the
> > specification are met.
> I'm afraid you are right, good catch.
>
> > I've looked briefly at the PDFBOX project in Jira and have found the
> > following tickets that may be caused by this same problem:
> >
> > PDFBOX-4785 <https://issues.apache.org/jira/browse/PDFBOX-4785>
> > PDFBOX-5350 <https://issues.apache.org/jira/browse/PDFBOX-5350>
> Yes, somehow. Those are about reading malformed pdfs containing the very
> same
> issue your have described above.
> Fun fact: we are complaining about other pdf writers not following the
> spec and
> are doing the very same: I never came up with the idea to check our own
> code :-(
>
> > I have put together a proposed solution here
> > <https://github.com/ryanjackson-wf/pdfbox/pull/1> in my fork of the
> PDFBox
> > GH mirror. With your permission I'd like to open a new Jira ticket for
> this
> > and collaborate with whomever would like to help drive this work to get
> it
> > reviewed and merged. I do have some open questions about how surrogates
> are
> > to be handled. I'm also open to changes in the proposed code.
> You don't have to wait for permission. Please create a JIRA ticket
> including a
> link to you PR
>
>
> > Thank you for your time.
> Thanks for you time and the proposed solution.
>
> Andreas
>
> >
> > Sincerely,
> >
> > Ryan Jackson
> > Senior Software Engineer
> > Workiva Inc.
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Suspected bug in and proposed fix for ToUnicodeWriter.writeTo

Reply via email to