[
https://issues.apache.org/jira/browse/AVRO-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205248#comment-13205248
]
Raymie Stata commented on AVRO-1022:
------------------------------------
I've pulled together some documentation on how different languages handle
non-ASCII characters in identifiers. You'll see that language vary greatly in
both what non-ASCII characters are allowed in identifiers, whether or not they
are normalized, and _how_ they are normalized when they are normalized.
One of the goals of Avro is to support specifications that interoperate well
across languages. Given all the variability in how different languages handle
non-ASCII characters, I stand by what I said earlier: handling Unicode well In
Avro is a lot of work, and doing it poorly (as we do now) just creates nasty
interop problems.
---
The Unicode consortium has published a recommendation for defining Unicode
identifiers:
http://www.unicode.org/reports/tr31/
C# follows it almost exactly (but not exactly); Python follows it mostly; Java
kind of follows it, but not really; C/C++ ignore it; and, as far as I can tell,
neither Ruby nor PHP have given Unicode identifiers much thought at all.
Regarding Python, Python 2.x only allowed ASCII characters in identifiers. It
wasn't until Python 3.x that Unicode characters were allowed. Phython 3.x
follows the Unicode TR31. However, while Python calls for NRKC normalization,
it does not use the "modified" NFKC normalization recommended in TR31.
C# follows Unicode TR31 exactly (except that it allows identifiers to start
with an underscore). Thus, C#'s handling of non-ASCII identifiers is similar
to Python's, except that C# calls for NFC rather than NFKC. Also, C# requires
that its input arrives in normal form, and states that "The behavior when
encountering an identifier not in Normalization Form C is
implementation-defined; however, a diagnostic is not required" (presumably a
diagnostic would be allowed). Python, on the other hand, says that
"identifiers are converted into the normal form NFKC while parsing."
Java makes no reference to TR31, but it does seem to have been inspired by it.
However, it's more restrictive than TR31 (and thus C# and Python). For
example, while Python (and TR31) allow non-spacing marks, Java does not. Also,
unlike TR31/C#/Python, the Java language does _not_ call for normalization, and
is rather explicit about this: "Unicode composite characters are different from
the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (Á,
\u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A,
\u0041) immediately followed by a NON-SPACING ACUTE (´, \u0301) when sorting,
but these are different in identifiers."
C/C++ does not come close to TR31 and is very restrictive still. The
specification lists just a few sets of non-ASCII letters that can be in an
identifier (http://www.kuzbass.ru:8086/docs/isocpp/extendid.html#extendid).
These exclude many other Unicode letters that are allowed by C#, Python and
Java, and excludes other non-letter characters (such as connecting punction)
allowed in those languages. Also, while TR31/C#/Java/Python allow non-Arabic
digits in identifiers (e.g., Ethiopic digits), C/C++ does not.
PHP defines a letter as follows: "a letter is a-z, A-Z, and the bytes from 127
through 255 (0x7f-0xff)." It says nothing about Unicode, including anything
about normalization. Since much of the time input is presumably in UTF-8, the
0x7f-0xff range implicitly captures _everything_ in Unicode that isn't in the
Basic Latin block -- this goes way beyond what's allowed by the languages
discussed above. In short, they just haven't thought about the problem.
I can't find a language spec for Ruby or much discussion on Unicode variables
in that language. More generally, it looks like Ruby's support for Unicode was
bad prior to 1.9 (Jan 2009). Here's a discussion of how 1.9 makes it better:
http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html
But there isn't any discussion of variable names.
Here's some summary info on support for Unicode variable-names in many
different languages:
http://rosettacode.org/wiki/Unicode_variable_names
> Error in validate name
> ----------------------
>
> Key: AVRO-1022
> URL: https://issues.apache.org/jira/browse/AVRO-1022
> Project: Avro
> Issue Type: Bug
> Components: java
> Reporter: Raymie Stata
> Priority: Minor
> Attachments: AVRO-1022.patch
>
>
> Fix schema.validateName to allow only ASCII letters, not Unicode letters.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira