[
https://issues.apache.org/jira/browse/PDFBOX-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Schwinn updated PDFBOX-1282:
-----------------------------------
Attachment: Characters_Arial.pdf
The PDF file with the unicode character table as example. the small texts are
with a standard font, the largewr letters are created with an unicode font.
> Unicode characters displayed with wrong glyps because of interpretation as 8
> bit strings
> ----------------------------------------------------------------------------------------
>
> Key: PDFBOX-1282
> URL: https://issues.apache.org/jira/browse/PDFBOX-1282
> Project: PDFBox
> Issue Type: Bug
> Components: PDFReader
> Affects Versions: 1.6.0
> Reporter: Daniel Schwinn
> Attachments: Characters_Arial.pdf, Characters_Arial_Modified.pdf
>
>
> the file Characters_Arial.pdf shows that some unicode values are displayed
> with wrong glyphs, for example the u2020 which is displayed as two spaces.
> Another Issue is that invalid unicode characters are not handled correctly.
> They should display
> the invalid character box or something like that. This is demonstrated with
> a modified version
> of the file.
> The method processEncodedText is called when the texts of the document are
> printed
> int codeLength = 1;
> for( int i=0; i<string.length; i+=codeLength)
> {
> // Decode the value to a Unicode character
> codeLength = 1;
> String c = font.encode( string, i, codeLength );
> if( c == null && i+1<string.length)
> {
> //maybe a multibyte encoding
> codeLength++;
> c = font.encode( string, i, codeLength );
> }
> This code tries to determine if the values in variable 'string' are 8 or 16
> bit values or even a mixture of both types of values <lol>.
> Everything works fine when variable 'string' contains 8 bit values, in most
> cases. If there is an invalid 8 bit value this character may be dropped
> together with the following character.
> The real problem occurs when the data in variable 'string' is encoded as 16
> bit values. For many characters this works fine as the first byte is usually
> not a valid character:
> for example u0041 is first tried as char 00 with codeLength=1 an as there is
> no entry for unicode 0 in the font it will be re-tried with codeLength=2 and
> then interpreted as u0041.
> But what happens if the first byte of the 16 bit code is also a valid
> character code?
> to check this I created the file Characters_Arial_Changed.pdf where I simply
> changed the 16-bit string <0041> which displays 'A' to <4141> which is an
> invalid character in this font. I Also changed a 8-bit string nearby from
> (0041) to the value <4141>.
> Note that there are now two strings with the same value <4141> which have to
> be displayed in a different way.
> Acrobat Reader then shows the invalid character box for the 16 bit string and
> 'AA' for the 8 bit string above. PDFBox shows 'AA' for both strings.
> Problems are occuring with valid unicode character codes too: Unicode u2020
> will be shown as two nice spaces in PDFBox where Adobe Reader shows the
> correct character.
> To guess that it is a 16 bit character when the first byte is an invalid
> character in the current font is the wrong way to handle the string values.
> If the variable 'string' contains 8 or 16 bit values can't be detected by
> analysing the values as the example shows.
> processEncodedText has to handle the data in variable 'string' as 16 bit
> values when the font which is used has an (unicode-)encoding which uses more
> than 256 characters, in all other cases it should be interpreted as 8 bit
> values!!!
> With an Unicode Font <4343> or (CC) should show the invalid character box,
> with an 8 bit font both values should show the text 'CC'. I have included
> this example in the file too.
> The Adobe documentation says about 8 or 16 bit values in strings for example:
> "When the current font is a Type 0 font whose Encoding entry is Identity-H or
> Identity-V, the string to be shown shall contain pairs of bytes representing
> CIDs, high-order byte first. When the current font is a CIDFont, the string
> to be shown shall contain pairs of bytes representing CIDs, high-order byte
> first. When the current font is a Type 2 CIDFont in which the CIDToGIDMap
> entry is Identity and if the TrueType font is embedded in the PDF file, the
> 2-byte CID values shall be identical glyph indices for the glyph descriptions
> in the TrueType font program."
> I guess depending on this information it has to be determined if the string
> is 8 or 16 bits!
> In my example pdf files the type 0 font has always the Indentity-H set as
> encoding and so the strings have to be en-/decoded as pure 16 bit strings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira