On 2018-09-21 18:27, Neia Neutuladh wrote:
D's currently accepted identifier characters are based on Unicode 2.0:
* ASCII range values are handled specially.
* Letters and combining marks from Unicode 2.0 are accepted.
* Numbers outside the ASCII range are accepted.
* Eight random punctuation marks are accepted.
This follows the C99 standard.
Many languages use the Unicode standard explicitly: C#, Go, Java,
Python, ECMAScript, just to name a few. A small number of languages
reject non-ASCII characters: Dart, Perl. Some languages are weirdly
generous: Swift and C11 allow everything outside the Basic Multilingual
Plane.
I'd like to update that so that D accepts something as a valid
identifier character if it's a letter or combining mark or modifier
symbol that's present in Unicode 11, or a non-ASCII number. This allows
the 146 most popular writing systems and a lot more characters from
those writing systems. This *would* reject those eight random
punctuation marks, so I'll keep them in as legacy characters.
It would mean we don't have to reference the C99 standard when
enumerating the allowed characters; we just have to refer to the Unicode
standard, which we already need to talk about in the lexical part of the
spec.
It might also make the lexer a tiny bit faster; it reduces the number of
valid-ident-char segments to search from 245 to 134. On the other hand,
it will change the ident char ranges from wchar to dchar, which means
the table takes up marginally more memory.
And, of course, it lets you write programs entirely in Linear B, and
that's a marketing ploy not to be missed.
I've got this coded up and can submit a PR, but I thought I'd get
feedback here first.
Does anyone see any horrible potential problems here?
Or is there an interestingly better option?
Does this need a DIP?
I'm not a native English speaker but I write all my public and private
code in English. Anyone I work with, I will expect them and make sure
they're writing the code in English as well. English is not enough
either, it has to be American English.
Despite this I think that D should support as much of the Unicode as
possible (including using Unicode for identifiers). It should not be up
to the programming language to decide which language the developer
should write the code in.
--
/Jacob Carlborg