Hi Mark, thanks for the reply.
> -----Original Message----- > From: Mark Thomas [mailto:ma...@apache.org] > Sent: Wednesday, September 25, 2013 5:01 PM > > One way I can > > think would be to XML-encode such characters ("ß" as "ß"). > > However, personally I would rather not do this, but write such > > characters directly ("ß"), so that the source is better readable (and > > encodings like UTF-8 guarantee that the characters are interpreted > > the same on each system, independently from the system language or > > geographic location). > > I don't like the idea of using XML encoding at all. Just to avoid a misunderstanding, with "XML encoding" you mean numeric character references like &#nnn; ? > > Could it be possible to change SVN Commit E-Mail system so that it > > may interpret diffs as UTF-8 instead of ISO-8859-1 (assuming all > > files which contain bytes > 0x7F are encoded as UTF-8)? (Or, that it > > tries to decode it as UTF-8, and if it fails, decode it as ISO-8859-1 > > ?) > > This is a question for infra. If UTF-8 fails then ISO-8859-1 is going to > fail as well. I mean, to guess a character encoding by first decoding it as UTF-8, and if it fails, assume the file was encoded as ISO-8859-1/Windows-1252. This approach seems to be used by some programs to decide if the file was encoded as UTF-8 or as ANSI when it doesn't have BOM bytes. For example, consider a file that contains only ASCII characters (< 0x7F) stored as single-byte-per-character. As UTF-8 is ASCII-compatble, you will get the same results if you interpret it as UTF-8 and with ISO-8859-1. However, if you have a file that contains "äöü" (german umlaut characters) as ISO-8859-1 (Bytes: E4 F6 FC), then UTF-8 decoding will fail because the bytes after the one which starts with 11xxxxxx (binary) don't start with 10xxxxxx; but decoding as ISO-8859-1 will succeed. This approach to guess the encoding (UTF-8 vs. ISO-8859-1/Windows-1252) seems to be used by programs like Notepad++ when opening text files without a BOM, and by TortoiseSVN when displaying file changes, and seems to be working well if you have files with either UTF-8 or ISO-8859-1/Windows-1252 (or other local encodings). Of course, this will not always work, e.g. if your text file that is encoded with ISO-8859-1 actually contains text like "ß". (Personally, for my projects I use UTF-8 for everything :) ) I was asking because I saw some i18n files like "LocalStrings_ja.properties" that encode non-ASCII characters with "\uXXXX", and I'd like to know if it is okay to put characters "ß" character in the XML file without encoding it by a numeric character reference, while the Commit E-Mails don't use UTF-8. If you are okay with this, then I don't mind changing the encoding for the SVN Commit E-Mails. Thanks! Konstantin --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org