This is an admittedly uncommon issue, but I encounter a problem from time to time involving bad character values in xls files.
According to the Excel97 format, text data is supposed to be in Unicode. There are 32 values in ISO-8859-1 (Latin-1) that are invalid, from 0x80 to 0x9F, and of course the same is true for the superset of Latin-1 which is Unicode. The Windows platform though has long used an 8 bit encoding called windows-1252 (the IANA official designation, also called codepage-1252), which extended Latin-1 by assigning 27 values in this range to symbols that have a 16 bit representation in Unicode (see http://www.microsoft.com/globaldev/reference/sbcs/1252.htm for the spec). See also http://www.iana.org/assignments/character-sets and http://www.iana.org/assignments/charset-reg/windows-1252 for reference. Apparently some versions of Microsoft software (software and OS versions unknown) make it possible to generate Excel97 files containing codepage-1252 character values. This causes problems when the data is passed on to other processes. At the very least, the intended symbol is lost (replaced by a "?") but it can also pass the invalid values on (depending of whether the text string contained only 8 bit encodings or not) to cause mischief elsewhere. I encounter this in an environment where we import xls files submitted by customers through the internet, which leads to a very mixed bag of files showing up. I propose a patch to: org.apache.poi.hssf.record.UnicodeString.fillFields where the characters are actually copied from the byte buffer to field_3_string. In the case of 8 bit strings (the grbit is 0) the patch would be to use the encoding "Cp1252" instead of "ISO-8859-1" as the default. Since as defined windows-1252 is a proper superset of Latin-1 this should always work. Pure Latin-1 text will be properly translated, and so will windows-1252 (the 27 symbols not found in Latin-1 will become Unicode values with a non-zero high byte). Alternatively, the 8 bit string could be scanned for bytes in the range 0x80-0x9F and if found the String constructor would be passed the encoding "Cp1252" instead of "ISO-8859-1". The only reason for doing it this way is would be if one theorized that deviant implementations on some version of Java existed, and sought to reduce exposure to this. For the 16 bit strings the patch would be to filter the characters as the copy it done to explicitly translate the 27 characters if found. There shouldn't be any issues going the other way, that is, writing Excel97 documents. Since Unicode is the documented standard (according to pg. 264 of the Excel97 Developer's Kit) there would never be a need to generate windows-1252 output. I am implementing my recommended patch right now for my own use, and will submit it for incorporation into Poi if you the team is interested. Carey Sublette --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
