This is an admittedly uncommon issue, but I encounter a problem from time to
time involving bad character values in xls files.

According to the Excel97 format, text data is supposed to be in Unicode.
There are 32 values in ISO-8859-1 (Latin-1) that are invalid, from 0x80 to
0x9F, and of course the same is true for the superset of Latin-1 which is
Unicode.

The Windows platform though has long used an 8 bit encoding called
windows-1252 (the IANA official designation, also called codepage-1252),
which extended Latin-1 by assigning 27 values in this range to symbols that
have a 16 bit representation in Unicode (see
http://www.microsoft.com/globaldev/reference/sbcs/1252.htm for the spec).
See also http://www.iana.org/assignments/character-sets and
http://www.iana.org/assignments/charset-reg/windows-1252 for reference.

Apparently some versions of Microsoft software (software and OS versions
unknown) make it possible to generate Excel97 files containing codepage-1252
character values. This causes problems when the data is passed on to other
processes. At the very least, the intended symbol is lost (replaced by a
"?") but it can also pass the invalid values on (depending of whether the
text string contained only 8 bit encodings or not) to cause mischief
elsewhere.

I encounter this in an environment where we import xls files submitted by
customers through the internet, which leads to a very mixed bag of files
showing up.

I propose a patch to:
org.apache.poi.hssf.record.UnicodeString.fillFields 
where the characters are actually copied from the byte buffer to
field_3_string.

In the case of 8 bit strings (the grbit is 0) the patch would be to use the
encoding "Cp1252" instead of  "ISO-8859-1" as the default. Since as defined
windows-1252 is a proper superset of Latin-1 this should always work. Pure
Latin-1 text will be properly translated, and so will windows-1252 (the 27
symbols not found in Latin-1 will become Unicode values with a non-zero high
byte). 

Alternatively, the 8 bit string could be scanned for bytes in the range
0x80-0x9F and if found the String constructor would be passed the encoding
"Cp1252" instead of "ISO-8859-1". The only reason for doing it this way is
would be if one theorized that deviant implementations on some version of
Java existed, and sought to reduce exposure to this.

For the 16 bit strings the patch would be to filter the characters as the
copy it done to explicitly translate the 27 characters if found.

There shouldn't be any issues going the other way, that is, writing Excel97
documents. Since Unicode is the documented standard (according to pg. 264 of
the Excel97 Developer's Kit) there would never be a need to generate
windows-1252 output.

I am implementing my recommended patch right now for my own use, and will
submit it for incorporation into Poi if you the team is interested.

Carey Sublette


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to