In preparation for a short talk I'm to give in Portland next week about how different operating systems, filesystems, and languages (including but not limited to regexes) handle Unicode, I got to thinking about normalization issues. And I think I've found a Java compiler bug, or at best, an infelicity in a grey area.
It's no consolation, but Perl has exactly the same problem (well, pair of
problems) as Java has here. We do the same thing as Java, which I think is
the Wrong Thing, and we are also at the mercy of our filesystem for mapping
of classnames to filesystem objects, which is even worse.
I would like someone to tell me why Java shouldn't be fixed to cope with
these matters, both as internal identifiers and as those that exist outside
Java proper, in the filesystem (classnames).
After reading about the differences between how Apple and
Sun did normalization in the filesystem:
http://developers.sun.com/global/products_platforms/solaris/reference/presentations/IUC29-FileSystems.pdf
I wondered what impact this might/must have on Java. After all,
classnames must map to filesystem entries, and therefore if the system
is doing any kind of normalization, you're going to have Issues. Apple
runs everything through NFD (well, nearly), whereas the Sun paper cited
from about five years ago says that they plan to do something analogous
to how "case-preserving but case-insensitive" filesystems behave: that is,
they'll let you create anything you want, but won't let you create a new
entry in the same directory if they are canonically equivalent.
Before I went so far as to test this on Apple and Sun machines, let
alone others, I thought I would just try my test on local variables
instead. I have now tested it on Sun, Apple, and Linux, including
various versions of the compiler, and they all report the same thing.
And the thing they report, I feel, is wrong, because I know that it
will not work this way for class names the way it will for local versions.
I will include the source code twice, once as plain text so you can
read it, and once as an octet stream lest a "helpful" mailer decides
it should be normalizing things that pass through it, an evil that the
Apple mouse will do to you believe it or not.
If you put this wicked file in a file called "nftest.java" and run
this command:
$ javac -encoding UTF-8 nftest.java && java nftest
Then you will get this output:
élève is 1.
élève is 2.
élève is 3.
élève is 4.
Those probably look the same. Running them through `uniquote -x` shows
though that they are not:
\x{E9}l\x{E8}ve is 1.
e\x{301}le\x{300}ve is 2.
\x{E9}le\x{300}ve is 3.
e\x{301}l\x{E8}ve is 4.
See the difference? Those are variable names, and I do not think Java
should permit duplicate variable names that differ only in normalization,
since it obviously cannot be permitted to do so for classnames, and it
feels hackish to have different identifier rules for classnames as for
other variables.
Is this is a bug? If so, are there plans to address it?
And what about the filesystem?
I am unaware of any document in The Unicode Standard that references
either or both of these issues; if any such exist, kindly point me at
them. My hunch is that these two problems, even though they are
completely consequential to Unicode, exist beyond the proper purview
of The Unicode Standard itself. But that doesn't absolve us from
solving them.
Has this been previously discussed, and if so, what if any decision
was made regarding these two interrelated problems?
Thank you very much.
--tom
PS: The MIME contents of this message are as follows:
msg part type/subtype size description
1 multipart/mixed 8904
1 text/plain 4071 a letter from tchrist
2 application/octet-stream 1560 the nftest(-v1).java program as octets
name="nftest-v1.java"
filename="nftest-v1.java"
3 text/plain 1560 the nftest(-v2).java program as plain
text
nftest-v1.java
Description: the nftest(-v1).java program as octets
/* * nftest.java * Tom Christiansen <[email protected]> * Tue Jul 19 08:13:29 MDT 2011 * * This tests whether Java normalizes its variable names. * We will use four different canonically equivalent strings, * as see if we can get four different answers, or a compilation * error. * * N String As a literal Graphemes Chars Norm? * ============================================================= * 1 élève "\x{E9}l\x{E8}ve" 5 5 NFC * 2 élève "e\x{301}le\x{300}ve" 5 7 NFD * 3 élève "\x{E9}le\x{300}ve" 5 6 mixed * 4 élève "e\x{301}l\x{E8}ve" 5 6 mixed */ import java.io.*; public class nftest { static PrintStream stdout; public static void main(String argv[]) throws IOException { int élève = 1; // "\x{E9}l\x{E8}ve" NFC int élève = 2; // "e\x{301}le\x{300}ve" NFD int élève = 3; // "\x{E9}le\x{300}ve" mixed int élève = 4; // "e\x{301}l\x{E8}ve" mixed stdout = new PrintStream(System.out, true, "UTF-8"); stdout.printf("%s is %d.\n", "élève", élève); // "\x{E9}l\x{E8}ve" NFC stdout.printf("%s is %d.\n", "élève", élève); // "e\x{301}le\x{300}ve" NFD stdout.printf("%s is %d.\n", "élève", élève); // "\x{E9}le\x{300}ve" mixed stdout.printf("%s is %d.\n", "élève", élève); // "e\x{301}l\x{E8}ve" mixed } }
