RE: International characters in source files and SVN commit messages (was: RE:r1525975)

Konstantin Preißer Wed, 25 Sep 2013 08:37:38 -0700

Hi Mark,

thanks for the reply.


> -----Original Message-----
> From: Mark Thomas [mailto:ma...@apache.org]
> Sent: Wednesday, September 25, 2013 5:01 PM

> > One way I can
> > think would be to XML-encode such characters ("ß" as "&#xDF;").
> > However, personally I would rather not do this, but write such
> > characters directly ("ß"), so that the source is better readable (and
> > encodings like UTF-8 guarantee that the characters are interpreted
> > the same on each system, independently from the system language or
> > geographic location).
> 
> I don't like the idea of using XML encoding at all.

Just to avoid a misunderstanding, with "XML encoding" you mean numeric 
character references like &#nnn; ?


> > Could it be possible to change SVN Commit E-Mail system so that it
> > may interpret diffs as UTF-8 instead of ISO-8859-1 (assuming all
> > files which contain bytes > 0x7F are encoded as UTF-8)? (Or, that it
> > tries to decode it as UTF-8, and if it fails, decode it as ISO-8859-1
> > ?)
> 
> This is a question for infra. If UTF-8 fails then ISO-8859-1 is going to
> fail as well.

I mean, to guess a character encoding by first decoding it as UTF-8, and if it 
fails, assume the file was encoded as ISO-8859-1/Windows-1252. This approach 
seems to be used by some programs to decide if the file was encoded as UTF-8 or 
as ANSI when it doesn't have BOM bytes.

For example, consider a file that contains only ASCII characters (< 0x7F) 
stored as single-byte-per-character. As UTF-8 is ASCII-compatble, you will get 
the same results if you interpret it as UTF-8 and with ISO-8859-1.

However, if you have a file that contains "äöü" (german umlaut characters) as 
ISO-8859-1 (Bytes: E4 F6 FC), then UTF-8 decoding will fail because the bytes 
after the one which starts with 11xxxxxx (binary) don't start with 10xxxxxx; 
but decoding as ISO-8859-1 will succeed.

This approach to guess the encoding (UTF-8 vs. ISO-8859-1/Windows-1252) seems 
to be used by programs like Notepad++ when opening text files without a BOM, 
and by TortoiseSVN when displaying file changes, and seems to be working well 
if you have files with either UTF-8 or ISO-8859-1/Windows-1252 (or other local  
encodings). Of course, this will not always work, e.g. if your text file that 
is encoded with ISO-8859-1 actually contains text like "ÃŸ". (Personally, for 
my projects I use UTF-8 for everything :) )


I was asking because I saw some i18n files like "LocalStrings_ja.properties" 
that encode non-ASCII characters with "\uXXXX", and I'd like to know if it is 
okay to put characters "ß" character in the XML file without encoding it by a 
numeric character reference, while the Commit E-Mails don't use UTF-8. If you 
are okay with this, then I don't mind changing the encoding for the SVN Commit 
E-Mails.

Thanks!

Konstantin


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

RE: International characters in source files and SVN commit messages (was: RE:r1525975)

Reply via email to