Quote glyphs (quoteright, quotedblright, etc.) not mapped to the right Unicode 
character
----------------------------------------------------------------------------------------

                 Key: PDFBOX-1129
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1129
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 1.7.0
            Reporter: Michael McCandless
            Priority: Minor
         Attachments: 000086.pdf

I have an example PDF (will attach) that uses a right-single-quote
character, but extracts incorrectly from PDFBox (using ExtractText).
If I copy/paste, the text is correct (I get U+2019 for the right
quote).

Search for "cashier" in the PDF, on page 1 to see it; that right quote
is supposed to come through as U+2019 I think.

I looked at the PDF in PDFDebugger, and I see this fragment in the
"Contents" for page 1:

  (Bring the voucher handout to the cashier\325s office \(10-180\))Tj

So somehow this \325 escape fails to map to the quoteright glyph.  The
font is partial embedded font BPOLKO+TimesNewRomanPSMT, and I can see
in the Charset (under FontDescriptor, for font F1) that it references
this glyph.

I also see a [correct] entry in glyphlist.txt, mapping to U+2019, so
that's not the problem.

Not sure what's going wrong... maybe somehow \325 fails to map to
quoteright? 

There are other glyphs (quotedblright, quotedblleft) that are also not
converted correctly, eg search for project review on page 2.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to