Tom,

JLS 3.8 [1] Identifiers states

"Two identifiers are the same only if they are identical, that is, have the same Unicode character
for each letter or digit/./

Identifiers that have the same external appearance may yet be different. For example, the identifiers consisting of the single letters LATIN CAPITAL LETTER A (|A|, |\u0041|), LATIN SMALL LETTER A (|a|, |\u0061|), GREEK CAPITAL LETTER ALPHA (|A|, |\u0391|), CYRILLIC SMALL LETTER A (|a|, |\u0430|) and MATHEMATICAL BOLD ITALIC SMALL A (|a|, |\ud835\udc82|)
are all different.

Unicode composite characters are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (|Á,| |\u00c1)| could be considered to be the same as a LATIN CAPITAL LETTER A (|A|, |\u0041)| immediately followed by a NON-SPACING ACUTE
(´, |\u0301|) when sorting, but these are different in identifiers."

We happened to have a short discussion regarding this section couple days ago (Alex is working on the latest JLS, and we were discussing the possibility of re-wording this section a little)...so at least for now those are NOT duplicate identifiers from Java language specification. It might be an implementation issue though, if the file system works in a different way, as you suggested.

-Sherman

[1] http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8


On 7/19/2011 7:52 AM, Tom Christiansen wrote:
In preparation for a short talk I'm to give in Portland next week about how
different operating systems, filesystems, and languages (including but not
limited to regexes) handle Unicode, I got to thinking about normalization
issues.  And I think I've found a Java compiler bug, or at best, an
infelicity in a grey area.

It's no consolation, but Perl has exactly the same problem (well, pair of
problems) as Java has here.  We do the same thing as Java, which I think is
the Wrong Thing, and we are also at the mercy of our filesystem for mapping
of classnames to filesystem objects, which is even worse.

I would like someone to tell me why Java shouldn't be fixed to cope with
these matters, both as internal identifiers and as those that exist outside
Java proper, in the filesystem (classnames).

After reading about the differences between how Apple and
Sun did normalization in the filesystem:

     
http://developers.sun.com/global/products_platforms/solaris/reference/presentations/IUC29-FileSystems.pdf

I wondered what impact this might/must have on Java.  After all,
classnames must map to filesystem entries, and therefore if the system
is doing any kind of normalization, you're going to have Issues.  Apple
runs everything through NFD (well, nearly), whereas the Sun paper cited
from about five years ago says that they plan to do something analogous
to how "case-preserving but case-insensitive" filesystems behave: that is,
they'll let you create anything you want, but won't let you create a new
entry in the same directory if they are canonically equivalent.

Before I went so far as to test this on Apple and Sun machines, let
alone others, I thought I would just try my test on local variables
instead.  I have now tested it on Sun, Apple, and Linux, including
various versions of the compiler, and they all report the same thing.
And the thing they report, I feel, is wrong, because I know that it
will not work this way for class names the way it will for local versions.

I will include the source code twice, once as plain text so you can
read it, and once as an octet stream lest a "helpful" mailer decides
it should be normalizing things that pass through it, an evil that the
Apple mouse will do to you believe it or not.

If you put this wicked file in a file called "nftest.java" and run
this command:

     $ javac -encoding UTF-8 nftest.java&&  java nftest

Then you will get this output:

     élève is 1.
     élève is 2.
     élève is 3.
     élève is 4.

Those probably look the same.  Running them through `uniquote -x` shows
though that they are not:

     \x{E9}l\x{E8}ve is 1.
     e\x{301}le\x{300}ve is 2.
     \x{E9}le\x{300}ve is 3.
     e\x{301}l\x{E8}ve is 4.

See the difference?  Those are variable names, and I do not think Java
should permit duplicate variable names that differ only in normalization,
since it obviously cannot be permitted to do so for classnames, and it
feels hackish to have different identifier rules for classnames as for
other variables.

Is this is a bug?  If so, are there plans to address it?
And what about the filesystem?

I am unaware of any document in The Unicode Standard that references
either or both of these issues; if any such exist, kindly point me at
them.  My hunch is that these two problems, even though they are
completely consequential to Unicode, exist beyond the proper purview
of The Unicode Standard itself.  But that doesn't absolve us from
solving them.

Has this been previously discussed, and if so, what if any decision
was made regarding these two interrelated problems?

Thank you very much.

--tom

     PS: The MIME contents of this message are as follows:

  msg part  type/subtype              size description
    1       multipart/mixed           8904
      1     text/plain                4071 a letter from tchrist
      2     application/octet-stream  1560 the nftest(-v1).java program as 
octets
              name="nftest-v1.java"
              filename="nftest-v1.java"
      3     text/plain                1560 the nftest(-v2).java program as 
plain text


Reply via email to