[jira] Created: (PDFBOX-833) Wrong encoding with Type1C font when specific encoding is defined

Timo Boehme (JIRA) Mon, 20 Sep 2010 06:41:02 -0700

Wrong encoding with Type1C font when specific encoding is defined
-----------------------------------------------------------------


                 Key: PDFBOX-833
                 URL: https://issues.apache.org/jira/browse/PDFBOX-833
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 1.3.0
            Reporter: Timo Boehme


The Type1C font implementation overwrites the encoding() method of PDFont base 
class. This results in a lookup of codes to characters as defined in the font.
However if an encoding is explicitly given (like WinAnsiEncoding) this leads to 
wrong results if encoding codes do not match glyph codes.
In a test document (which unfortunately I cannot make public - an article from 
Elsevier) a Type1C font is embedded which defines a copyright sign at glyph 
position 259. The encoding is defines as WinAnsiEncoding. Text characters are 
defined corresponding to the WinAnsiEncoding. In case of the copyright sign it 
is 0xa9 (169) where the font has glyph 'quotesingle' defined.
Since currently I have no other test cases I implemented following workaround 
for WinAnsiEncoding (which might be relaxed to other PDF encodings as well:
in PDType1CFont.encode() I start with:

if ( getEncoding() instanceof WinAnsiEncoding )
  // use PDFont encoding
  return super.encode( bytes, offset, length );

This resolves the encoding problems for text extraction.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PDFBOX-833) Wrong encoding with Type1C font when specific encoding is defined

Reply via email to