Dear Andreas, Thank you. I'll write up a ticket soon. I may not be able to get to it until Monday MST (US) but will create it and add some sample PDF files to the ticket. I also have a working Adobe Acrobat Pro example (they are using the bfchar operator instead).
Vielen Dank! Ryan. On Sat, Mar 12, 2022 at 3:28 AM Andreas Lehmkuehler <[email protected]> wrote: > Hi, > > Am 11.03.22 um 21:49 schrieb Ryan Jackson: > > Dear Apache Devs: > > > > I believe that I have identified a bug in the creation of the > > (begin/end)bfrange operator used when embedding fonts with the > > PDCIDFontType2Embedder class. > > > > The bug exists (as best I can tell) in both the main trunk and in the 2.0 > > branch. The code in question may be found here > > < > https://github.com/ryanjackson-wf/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/ToUnicodeWriter.java#L133-L136 > >. > > The portion of the PDF specification (version 1.7) that bears upon this > > code is Section 5.9, Example 5.16. > > > > The existing code attempts to limit the range logic to changes less than > or > > equal to 255 code points, but it fails to account for at least the > > following situation by allowing this (for example): > > > > [srcCode1 srcCode2 dstString] > > 03FF 0400 0036 > > > > The overflow between srcCode1 and srcCode2 is not allowed by the > > specification and any text extraction will fail. The glyphs themselves > > render fine so it is not immediately obvious there is a problem until one > > tries to examine the text by using the Content Panel or by copy/pasting > > from Acrobat (Pro) to some other document. By contrast the following > > bfrange operator does allow the text extraction to work as intended: > > > > [srcCode1 srcCode2 dstString] > > 03FE 03FF 0035 > > > > Notice that no overflow exists, and as such the requirements of the > > specification are met. > I'm afraid you are right, good catch. > > > I've looked briefly at the PDFBOX project in Jira and have found the > > following tickets that may be caused by this same problem: > > > > PDFBOX-4785 <https://issues.apache.org/jira/browse/PDFBOX-4785> > > PDFBOX-5350 <https://issues.apache.org/jira/browse/PDFBOX-5350> > Yes, somehow. Those are about reading malformed pdfs containing the very > same > issue your have described above. > Fun fact: we are complaining about other pdf writers not following the > spec and > are doing the very same: I never came up with the idea to check our own > code :-( > > > I have put together a proposed solution here > > <https://github.com/ryanjackson-wf/pdfbox/pull/1> in my fork of the > PDFBox > > GH mirror. With your permission I'd like to open a new Jira ticket for > this > > and collaborate with whomever would like to help drive this work to get > it > > reviewed and merged. I do have some open questions about how surrogates > are > > to be handled. I'm also open to changes in the proposed code. > You don't have to wait for permission. Please create a JIRA ticket > including a > link to you PR > > > > Thank you for your time. > Thanks for you time and the proposed solution. > > Andreas > > > > > Sincerely, > > > > Ryan Jackson > > Senior Software Engineer > > Workiva Inc. > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
