On Mon, Oct 23, 2006 at 05:50:03PM +0200, Edwin Steiner wrote: > > I checked what the reference implementation does, using the attached > program: The RI always interprets the filenames it gets from the system > as latin1 (or similar), independent of the file.encoding property, it > seems. This has the following consequences:
I conducted further tests (the attached shell scripts help to create a
file with latin1 and utf-8 encoded name, respectively).
What the RI does depends on the setting of the LANG variable.
For LANG=C
* in the latin1 filename e4 gets replaced by the replacement
character U+fffd, encoded as ef bf bd in UTF-8 output
(see http://www.fileformat.info/info/unicode/char/fffd/index.htm).
* in the utf-8 filename c3 a4 becomes replaced by _two_
replacement characters: U+fffd U+fffd.
For LANG=en_US.UTF-8
* the latin1 character e4 gets replaced by the replacement character
(e4 becomes U+fffd).
* the utf-8 filename is read correctly (c3 a4 becomes U+00e4).
For LANG=en_US.iso88591
* the latin1 filename is read correctly (e4 becomes U+00e4).
* the utf-8 filename is read as latin1 (c3 a4 becomes the _two_
characters U+00c3 and U+00a4).
Aren't encodings fun? ;)
-Edwin
t-latin1.sh
Description: Bourne shell script
t-utf8.sh
Description: Bourne shell script

