[jira] [Created] (PDFBOX-1282) Unicode characters displayed with wrong glyps because of interpretation as 8 bit strings

Daniel Schwinn (Created) (JIRA) Fri, 06 Apr 2012 14:52:41 -0700

Unicode characters displayed with wrong glyps because of interpretation as 8 
bit strings
----------------------------------------------------------------------------------------


                 Key: PDFBOX-1282
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1282
             Project: PDFBox
          Issue Type: Bug
          Components: PDFReader
    Affects Versions: 1.6.0
            Reporter: Daniel Schwinn


the file Characters_Arial.pdf  shows that some unicode values are displayed 
with wrong glyphs, for example the u2020 which is displayed as two spaces. 
Another Issue is that invalid unicode characters are not handled correctly. 
They should display
 the invalid character box or something like that. This is demonstrated with a 
modified version
 of the file.

The method processEncodedText is called when the texts of the document are 
printed 

        int codeLength = 1;
        for( int i=0; i<string.length; i+=codeLength)
        {
            // Decode the value to a Unicode character
            codeLength = 1;
            String c = font.encode( string, i, codeLength );
            if( c == null && i+1<string.length)
            {
                //maybe a multibyte encoding
                codeLength++;
                c = font.encode( string, i, codeLength );
            }

This code tries to determine if the values in variable 'string' are 8 or 16 bit 
values or even a mixture of both types of values <lol>. 

Everything works fine when variable 'string' contains 8 bit values, in most 
cases. If there is an invalid  8 bit value this character may be dropped 
together with the following character.

The real problem occurs when the data in variable 'string' is encoded as 16 bit 
values. For many characters this works fine as the first byte is usually not a 
valid character: 
for example u0041 is first tried as char 00 with codeLength=1 an as there is no 
entry for unicode 0 in the font it will be re-tried with codeLength=2 and then 
interpreted as u0041.

But what happens if the first byte of the 16 bit code is also a valid character 
code?

to check this I created the file Characters_Arial_Changed.pdf where I simply 
changed the 16-bit string <0041> which displays 'A' to <4141> which is an 
invalid character in this font. I Also changed a 8-bit string nearby from 
(0041) to the value <4141>.

Note that there are now two strings with the same value <4141> which have to be 
displayed in a different way.

Acrobat Reader then shows the invalid character box for the 16 bit string and 
'AA' for the 8 bit string above. PDFBox shows 'AA' for both strings.

Problems are occuring with valid unicode character codes too: Unicode u2020 
will be shown as two nice spaces in PDFBox where Adobe Reader shows the correct 
character.

To guess that it is a 16 bit character when the first byte is an invalid 
character in the current font is the wrong way to handle the string values. If 
the variable 'string' contains 8 or 16 bit values can't be detected by 
analysing the values as the example shows.

processEncodedText has to handle the data in variable 'string' as 16 bit values 
when the font which is used has an (unicode-)encoding which uses more than 256 
characters, in all other cases it should be interpreted as 8 bit values!!! 

With an Unicode Font <4343> or (CC) should show the invalid character box, with 
an 8 bit font both values should show the text 'CC'. I have included this 
example in the file too.

The Adobe documentation says about 8 or 16 bit values in strings for example:
"When the current font is a Type 0 font whose Encoding entry is Identity-H or 
Identity-V, the string to be shown shall contain pairs of bytes representing 
CIDs, high-order byte first. When the current font is a CIDFont, the string to 
be shown shall contain pairs of bytes representing CIDs, high-order byte first. 
When the current font is a Type 2 CIDFont in which the CIDToGIDMap entry is 
Identity and if the TrueType font is embedded in the PDF file, the 2-byte CID 
values shall be identical glyph indices for the glyph descriptions in the 
TrueType font program."

I guess depending on this information it has to be determined if the string is 
8 or 16 bits!

In my example pdf files the type 0 font has always the Indentity-H set as 
encoding and so the strings have to be en-/decoded as pure 16 bit strings.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PDFBOX-1282) Unicode characters displayed with wrong glyps because of interpretation as 8 bit strings

Reply via email to