Re: Is(n't) this a Java Unicode compiler bug? [4=OSCON]

Xueming Shen Tue, 19 Jul 2011 10:40:06 -0700

 Tom,

JLS 3.8 [1] Identifiers states

"Two identifiers are the same only if they are identical, that is, havethe same Unicode character

for each letter or digit/./

Identifiers that have the same external appearance may yet be different.For example, theidentifiers consisting of the single letters LATIN CAPITAL LETTER A(|A|, |\u0041|), LATIN SMALLLETTER A (|a|, |\u0061|), GREEK CAPITAL LETTER ALPHA (|A|, |\u0391|),CYRILLIC SMALLLETTER A (|a|, |\u0430|) and MATHEMATICAL BOLD ITALIC SMALL A (|a|,|\ud835\udc82|)

are all different.

Unicode composite characters are different from the decomposedcharacters. For example, aLATIN CAPITAL LETTER A ACUTE (|Á,| |\u00c1)| could be considered to bethe same as aLATIN CAPITAL LETTER A (|A|, |\u0041)| immediately followed by aNON-SPACING ACUTE

(´, |\u0301|) when sorting, but these are different in identifiers."

We happened to have a short discussion regarding this section coupledays ago (Alex is workingon the latest JLS, and we were discussing the possibility of re-wordingthis section a little)...soat least for now those are NOT duplicate identifiers from Java languagespecification. It mightbe an implementation issue though, if the file system works in adifferent way, as you suggested.


-Sherman

[1] http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8


On 7/19/2011 7:52 AM, Tom Christiansen wrote:

In preparation for a short talk I'm to give in Portland next week about how
different operating systems, filesystems, and languages (including but not
limited to regexes) handle Unicode, I got to thinking about normalization
issues. And I think I've found a Java compiler bug, or at best, an
infelicity in a grey area.

It's no consolation, but Perl has exactly the same problem (well, pair of
problems) as Java has here. We do the same thing as Java, which I think is
the Wrong Thing, and we are also at the mercy of our filesystem for mapping
of classnames to filesystem objects, which is even worse.

I would like someone to tell me why Java shouldn't be fixed to cope with
these matters, both as internal identifiers and as those that exist outside
Java proper, in the filesystem (classnames).

After reading about the differences between how Apple and
Sun did normalization in the filesystem:

http://developers.sun.com/global/products_platforms/solaris/reference/presentations/IUC29-FileSystems.pdf

I wondered what impact this might/must have on Java. After all,
classnames must map to filesystem entries, and therefore if the system
is doing any kind of normalization, you're going to have Issues. Apple
runs everything through NFD (well, nearly), whereas the Sun paper cited
from about five years ago says that they plan to do something analogous
to how "case-preserving but case-insensitive" filesystems behave: that is,
they'll let you create anything you want, but won't let you create a new
entry in the same directory if they are canonically equivalent.

Before I went so far as to test this on Apple and Sun machines, let
alone others, I thought I would just try my test on local variables
instead. I have now tested it on Sun, Apple, and Linux, including
various versions of the compiler, and they all report the same thing.
And the thing they report, I feel, is wrong, because I know that it
will not work this way for class names the way it will for local versions.

I will include the source code twice, once as plain text so you can
read it, and once as an octet stream lest a "helpful" mailer decides
it should be normalizing things that pass through it, an evil that the
Apple mouse will do to you believe it or not.

If you put this wicked file in a file called "nftest.java" and run
this command:

$ javac -encoding UTF-8 nftest.java&& java nftest

Then you will get this output:

élève is 1.
élève is 2.
élève is 3.
élève is 4.

Those probably look the same. Running them through `uniquote -x` shows
though that they are not:

\x{E9}l\x{E8}ve is 1.
e\x{301}le\x{300}ve is 2.
\x{E9}le\x{300}ve is 3.
e\x{301}l\x{E8}ve is 4.

See the difference? Those are variable names, and I do not think Java
should permit duplicate variable names that differ only in normalization,
since it obviously cannot be permitted to do so for classnames, and it
feels hackish to have different identifier rules for classnames as for
other variables.

Is this is a bug? If so, are there plans to address it?
And what about the filesystem?

I am unaware of any document in The Unicode Standard that references
either or both of these issues; if any such exist, kindly point me at
them. My hunch is that these two problems, even though they are
completely consequential to Unicode, exist beyond the proper purview
of The Unicode Standard itself. But that doesn't absolve us from
solving them.

Has this been previously discussed, and if so, what if any decision
was made regarding these two interrelated problems?

Thank you very much.

--tom

PS: The MIME contents of this message are as follows:

msg part type/subtype size description
1 multipart/mixed 8904
1 text/plain 4071 a letter from tchrist
2 application/octet-stream 1560 the nftest(-v1).java program as
octets
name="nftest-v1.java"
filename="nftest-v1.java"
3 text/plain 1560 the nftest(-v2).java program as
plain text

Re: Is(n't) this a Java Unicode compiler bug? [4=OSCON]

Reply via email to