[ 
https://issues.apache.org/jira/browse/PDFBOX-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631408#comment-13631408
 ] 

Luis Bernardo commented on PDFBOX-833:
--------------------------------------

I think I may have bumped into this bug and other related ones (wrong mapping 
of character codes to glyphs with Type1C fonts). I investigated and came up 
with a fix which I am attaching. The issue was complicated by apparently some 
bug in Sun JDK's native code (used by StandardGlyphVector) that resulted in 
character codes being mapped to the missing glyph. To get around this apparent 
bug the patch includes a workaround to do the mapping directly (this requires 
the use of reflection). In OpenJDK, which uses Freetype for the native code, 
this particular bug does not happen, but there is other issue that this patch 
also addresses (Freetype expects uppercase hexadecimal names).

I am attaching also part of the sample document that I used to investigate 
this, together with image outputs before and after the patch is applied that 
show what the issue is. 
                
> Wrong encoding with Type1C font when specific encoding is defined
> -----------------------------------------------------------------
>
>                 Key: PDFBOX-833
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-833
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.3.1
>            Reporter: Timo Boehme
>
> The Type1C font implementation overwrites the encoding() method of PDFont 
> base class. This results in a lookup of codes to characters as defined in the 
> font.
> However if an encoding is explicitly given (like WinAnsiEncoding) this leads 
> to wrong results if encoding codes do not match glyph codes.
> In a test document (which unfortunately I cannot make public - an article 
> from Elsevier) a Type1C font is embedded which defines a copyright sign at 
> glyph position 259. The encoding is defines as WinAnsiEncoding. Text 
> characters are defined corresponding to the WinAnsiEncoding. In case of the 
> copyright sign it is 0xa9 (169) where the font has glyph 'quotesingle' 
> defined.
> Since currently I have no other test cases I implemented following workaround 
> for WinAnsiEncoding (which might be relaxed to other PDF encodings as well:
> in PDType1CFont.encode() I start with:
> if ( getEncoding() instanceof WinAnsiEncoding )
>   // use PDFont encoding
>   return super.encode( bytes, offset, length );
> This resolves the encoding problems for text extraction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to