https://bz.apache.org/bugzilla/show_bug.cgi?id=57008
--- Comment #14 from Mark Murphy <[email protected]> --- (In reply to Matthias Gerth from comment #13) > I've written an escape function as a workaround. > https://stackoverflow.com/questions/48222502/xssfcell-in-apache-poi-encodes- > certain-character-sequences-as-unicode-character > > So my use case is this: I need to store a string containing "_x24B8_" into > an excel file. This is user input and I can not prevent this. The setValue > function on XSSFCell has one parameter of type java.lang.String. Java string > does not use microsoft encoding to represent unicode character. > > So this happens > 1. String value = "_x24B8_"; > 2. String valueEscaped = escape(value); // "_x005F_x24B8_" > 3. cell.setValue(valueEscaped) // cell.value is now "_x24B8_" > 4. once the file is written is changes back to "_x005F_x24B8_" in the file > > I think setValue should not call XSSFRichTextString.utfDecode(). This would > prevent this back and forth encoding. > We could also make XSSFRichTextString.utfDecode() a public for people who > are using this type of encoding. I would prefer this microsoft encoding > terminated within the library since it is specific to the office file format. In my mind, this is how it should work: cell.setValue() should perform a utfEscape() (this does not currently exist) which will add "_x005F" to anything that looks like "_x####_" but not to things that look like "_x005F_x####_". conversely cell.getRichTextString() should perform a utfUnescape() (this does not exist) which will strip the "_x005F" from anything that looks like "_x005F_x####_". But we need to make sure that when reading a file, or processing XML, these Unicode special characters remain unchanged. Section 3.18.96 in Office Open XML 1st Edition Part 4 indicates that Unicode characters that cannot be in an XML document need to be encoded as "_x####_". I was unable to find out just which characters those were, so I don't know about encoding and decoding these characters. Right now it looks like we just decode "_x####_" to the Unicode character on read. So If I use your example above: 1. cell.setValue("_x24B8_"); // resulting XML should contain "_x005F_x24B8_" 2. cell.getRichTextString(); // should return "_x24B8_" For a file read in that contains "_x24B8_" in a cell: 3. cell.getRechTextString(): // should return "Ⓒ" - utfDecode() does this correctly. For a file read in that contains "_x005F_x24B8_" in a cell: 3. cell.getRechTextString(): // should return "_x24B8_" -- You are receiving this mail because: You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
