[jira] [Commented] (AVRO-1022) Error in validate name

Raymie Stata (Commented) (JIRA) Thu, 09 Feb 2012 21:36:53 -0800

    [ 
https://issues.apache.org/jira/browse/AVRO-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205248#comment-13205248
 ]


Raymie Stata commented on AVRO-1022:
------------------------------------

I've pulled together some documentation on how different languages handle 
non-ASCII characters in identifiers.  You'll see that language vary greatly in 
both what non-ASCII characters are allowed in identifiers, whether or not they 
are normalized, and _how_ they are normalized when they are normalized.

One of the goals of Avro is to support specifications that interoperate well 
across languages.  Given all the variability in how different languages handle 
non-ASCII characters, I stand by what I said earlier: handling Unicode well In 
Avro is a lot of work, and doing it poorly (as we do now) just creates nasty 
interop problems.

---

The Unicode consortium has published a recommendation for defining Unicode 
identifiers:

http://www.unicode.org/reports/tr31/

C# follows it almost exactly (but not exactly); Python follows it mostly; Java 
kind of follows it, but not really; C/C++ ignore it; and, as far as I can tell, 
neither Ruby nor PHP have given Unicode identifiers much thought at all.

Regarding Python, Python 2.x only allowed ASCII characters in identifiers.  It 
wasn't until Python 3.x that Unicode characters were allowed.  Phython 3.x 
follows the Unicode TR31.  However, while Python calls for NRKC normalization, 
it does not use the "modified" NFKC normalization recommended in TR31.

C# follows Unicode TR31 exactly (except that it allows identifiers to start 
with an underscore).  Thus, C#'s handling of non-ASCII identifiers is similar 
to Python's, except that C# calls for NFC rather than NFKC.  Also, C# requires 
that its input arrives in normal form, and states that "The behavior when 
encountering an identifier not in Normalization Form C is 
implementation-defined; however, a diagnostic is not required" (presumably a 
diagnostic would be allowed).  Python, on the other hand, says that 
"identifiers are converted into the normal form NFKC while parsing."

Java makes no reference to TR31, but it does seem to have been inspired by it.  
However, it's more restrictive than TR31 (and thus C# and Python).  For 
example, while Python (and TR31) allow non-spacing marks, Java does not.  Also, 
unlike TR31/C#/Python, the Java language does _not_ call for normalization, and 
is rather explicit about this: "Unicode composite characters are different from 
the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (Á, 
\u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A, 
\u0041) immediately followed by a NON-SPACING ACUTE (´, \u0301) when sorting, 
but these are different in identifiers."

C/C++ does not come close to TR31 and is very restrictive still.  The 
specification lists just a few sets of non-ASCII letters that can be in an 
identifier (http://www.kuzbass.ru:8086/docs/isocpp/extendid.html#extendid).  
These exclude many other Unicode letters that are allowed by C#, Python and 
Java, and excludes other non-letter characters (such as connecting punction) 
allowed in those languages.  Also, while TR31/C#/Java/Python allow non-Arabic 
digits in identifiers (e.g., Ethiopic digits), C/C++ does not.

PHP defines a letter as follows: "a letter is a-z, A-Z, and the bytes from 127 
through 255 (0x7f-0xff)."  It says nothing about Unicode, including anything 
about normalization.  Since much of the time input is presumably in UTF-8, the 
0x7f-0xff range implicitly captures _everything_ in Unicode that isn't in the 
Basic Latin block -- this goes way beyond what's allowed by the languages 
discussed above.  In short, they just haven't thought about the problem.

I can't find a language spec for Ruby or much discussion on Unicode variables 
in that language.  More generally, it looks like Ruby's support for Unicode was 
bad prior to 1.9 (Jan 2009).  Here's a discussion of how 1.9 makes it better: 
http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html 
 But there isn't any discussion of variable names.

Here's some summary info on support for Unicode variable-names in many 
different languages:

http://rosettacode.org/wiki/Unicode_variable_names

                
> Error in validate name
> ----------------------
>
>                 Key: AVRO-1022
>                 URL: https://issues.apache.org/jira/browse/AVRO-1022
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>            Reporter: Raymie Stata
>            Priority: Minor
>         Attachments: AVRO-1022.patch
>
>
> Fix schema.validateName to allow only ASCII letters, not Unicode letters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-1022) Error in validate name

Reply via email to