Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-11-22 Thread Henri Sivonen via Unicode
On Wed, Jun 13, 2018 at 2:49 PM Mark Davis ☕️  wrote:
>
> > That is, why is conforming to UAX #31 worth the risk of prohibiting the use 
> > of characters that some users might want to use?
>
> One could parse for certain sequences, putting characters into a number of 
> broad categories. Very approximately:
>
> junk ~= [[:cn:][:cs:][:co:]]+
> whitespace ~= [[:z:][:c:]-junk]+
> syntax ~= [[:s:][:p:]] // broadly speaking, including both the language 
> syntax  user-named operators
> identifiers ~= [all-else]+
>
> UAX #31 specifies several different kinds of identifiers, and takes roughly 
> that approach for 
> http://unicode.org/reports/tr31/#Immutable_Identifier_Syntax, although the 
> focus there is on immutability.
>
> So an implementation could choose to follow that course, rather than the more 
> narrowly defined identifiers in 
> http://unicode.org/reports/tr31/#Default_Identifier_Syntax. Alternatively, 
> one can conform to the Default Identifiers but declare a profile that expands 
> the allowable characters. One could take a Swiftian approach, for example...

Thank you and sorry about my slow reply. Why is excluding junk important?

> On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode  wrote:
>>
>> On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen  wrote:
>> > Considering that ruling out too much can be a problem later, but just
>> > treating anything above ASCII as opaque hasn't caused trouble (that I
>> > know of) for HTML other than compatibility issues with XML's stricter
>> > stance, why should a programming language, if it opts to support
>> > non-ASCII identifiers in an otherwise ASCII core syntax, implement the
>> > complexity of UAX #31 instead of allowing everything above ASCII in
>> > identifiers? In other words, what problem does making a programming
>> > language conform to UAX #31 solve?
>>
>> After refreshing my memory of XML history, I realize that mentioning
>> XML does not helpfully illustrate my question despite the mention of
>> XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please
>> ignore the XML part.
>>
>> Trying to rephrase my question more clearly:
>>
>> Let's assume that we are designing a computer-parseable syntax where
>> tokens consisting of user-chosen characters can't occur next to each
>> other and, instead, always have some syntax-reserved characters
>> between them. That is, I'm talking about syntaxes that look like this
>> (could be e.g. Java):
>>
>> ab.cd();
>>
>> Here, ab and cd are tokens with user-chosen characters whereas space
>> (the indent),  period, parenthesis and the semicolon are
>> syntax-reserved. We know that ab and cd are distinct tokens, because
>> there is a period between them, and we know the opening parethesis
>> ends the cd token.
>>
>> To illustrate what I'm explicitly _not_ talking about, I'm not talking
>> about a syntax like this:
>>
>> αβ⊗γδ
>>
>> Here αβ and γδ are user-named variable names and ⊗ is a user-named
>> operator and the distinction between different kinds of user-named
>> tokens has to be known somehow in order to be able to tell that there
>> are three distinct tokens: αβ, ⊗, and γδ.
>>
>> My question is:
>>
>> When designing a syntax where tokens with the user-chosen characters
>> can't occur next to each other without some syntax-reserved characters
>> between them, what advantages are there from limiting the user-chosen
>> characters according to UAX #31 as opposed to treating any character
>> that is not a syntax-reserved character as a character that can occur
>> in user-named tokens?
>>
>> I understand that taking the latter approach allows users to mint
>> tokens that on some aesthetic measure don't make sense (e.g. minting
>> tokens that consist of glyphless code points), but why is it important
>> to prescribe that this is prohibited as opposed to just letting users
>> choose not to mint tokens that are inconvenient for them to work with
>> given the behavior that their plain text editor gives to various
>> characters? That is, why is conforming to UAX #31 worth the risk of
>> prohibiting the use of characters that some users might want to use?
>> The introduction of XID after ID and the introduction of Extended
>> Hashtag Identifiers after XID is indicative of over-restriction having
>> been a problem.
>>
>> Limiting user-minted tokens to UAX #31 does not appear to be necessary
>> for security purposes considering that HTML and CSS exist in a
>> particularly adversarial environment and get away with taking the
>> approach that any character that isn't a syntax-reserved character is
>> collected as part of a user-minted identifier. (Informally, both treat
>> non-ASCII characters the same as an ASCII underscore. HTML even treats
>> non-whitespace, non-U+ ASCII controls that way.)
>>
>> --
>> Henri Sivonen
>> hsivo...@hsivonen.fi
>> https://hsivonen.fi/
>>
>


-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-13 Thread Mark Davis ☕️ via Unicode
> That is, why is conforming to UAX #31 worth the risk of prohibiting the
use of characters that some users might want to use?

One could parse for certain sequences, putting characters into a number of
broad categories. Very approximately:

   - junk ~= [[:cn:][:cs:][:co:]]+
   - whitespace ~= [[:z:][:c:]-junk]+
   - syntax ~= [[:s:][:p:]] // broadly speaking, including both the
   language syntax & user-named operators
   - identifiers ~= [all-else]+

UAX #31 specifies several different kinds of identifiers, and takes roughly
that approach for
http://unicode.org/reports/tr31/#Immutable_Identifier_Syntax, although the
focus there is on immutability.

So an implementation could choose to follow that course, rather than the
more narrowly defined identifiers in
http://unicode.org/reports/tr31/#Default_Identifier_Syntax. Alternatively,
one can conform to the Default Identifiers but declare a profile that
expands the allowable characters. One could take a Swiftian approach
,
for example...

Mark

On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode <
unicode@unicode.org> wrote:

> On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen 
> wrote:
> > Considering that ruling out too much can be a problem later, but just
> > treating anything above ASCII as opaque hasn't caused trouble (that I
> > know of) for HTML other than compatibility issues with XML's stricter
> > stance, why should a programming language, if it opts to support
> > non-ASCII identifiers in an otherwise ASCII core syntax, implement the
> > complexity of UAX #31 instead of allowing everything above ASCII in
> > identifiers? In other words, what problem does making a programming
> > language conform to UAX #31 solve?
>
> After refreshing my memory of XML history, I realize that mentioning
> XML does not helpfully illustrate my question despite the mention of
> XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please
> ignore the XML part.
>
> Trying to rephrase my question more clearly:
>
> Let's assume that we are designing a computer-parseable syntax where
> tokens consisting of user-chosen characters can't occur next to each
> other and, instead, always have some syntax-reserved characters
> between them. That is, I'm talking about syntaxes that look like this
> (could be e.g. Java):
>
> ab.cd();
>
> Here, ab and cd are tokens with user-chosen characters whereas space
> (the indent),  period, parenthesis and the semicolon are
> syntax-reserved. We know that ab and cd are distinct tokens, because
> there is a period between them, and we know the opening parethesis
> ends the cd token.
>
> To illustrate what I'm explicitly _not_ talking about, I'm not talking
> about a syntax like this:
>
> αβ⊗γδ
>
> Here αβ and γδ are user-named variable names and ⊗ is a user-named
> operator and the distinction between different kinds of user-named
> tokens has to be known somehow in order to be able to tell that there
> are three distinct tokens: αβ, ⊗, and γδ.
>
> My question is:
>
> When designing a syntax where tokens with the user-chosen characters
> can't occur next to each other without some syntax-reserved characters
> between them, what advantages are there from limiting the user-chosen
> characters according to UAX #31 as opposed to treating any character
> that is not a syntax-reserved character as a character that can occur
> in user-named tokens?
>
> I understand that taking the latter approach allows users to mint
> tokens that on some aesthetic measure don't make sense (e.g. minting
> tokens that consist of glyphless code points), but why is it important
> to prescribe that this is prohibited as opposed to just letting users
> choose not to mint tokens that are inconvenient for them to work with
> given the behavior that their plain text editor gives to various
> characters? That is, why is conforming to UAX #31 worth the risk of
> prohibiting the use of characters that some users might want to use?
> The introduction of XID after ID and the introduction of Extended
> Hashtag Identifiers after XID is indicative of over-restriction having
> been a problem.
>
> Limiting user-minted tokens to UAX #31 does not appear to be necessary
> for security purposes considering that HTML and CSS exist in a
> particularly adversarial environment and get away with taking the
> approach that any character that isn't a syntax-reserved character is
> collected as part of a user-minted identifier. (Informally, both treat
> non-ASCII characters the same as an ASCII underscore. HTML even treats
> non-whitespace, non-U+ ASCII controls that way.)
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
>
>


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-08 Thread Hans Åberg via Unicode


> On 8 Jun 2018, at 11:07, Henri Sivonen via Unicode  
> wrote:
> 
> My question is:
> 
> When designing a syntax where tokens with the user-chosen characters
> can't occur next to each other without some syntax-reserved characters
> between them, what advantages are there from limiting the user-chosen
> characters according to UAX #31 as opposed to treating any character
> that is not a syntax-reserved character as a character that can occur
> in user-named tokens?

It seems best to stick to the canonical forms and add the sequences one deems 
useful and safe, as treating inequivalent characters as equal is likely to be 
confusing. But this requires more work; it seems that the use of the 
compatibility forms is aimed at something simple to implement.





Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-08 Thread Henri Sivonen via Unicode
On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen  wrote:
> Considering that ruling out too much can be a problem later, but just
> treating anything above ASCII as opaque hasn't caused trouble (that I
> know of) for HTML other than compatibility issues with XML's stricter
> stance, why should a programming language, if it opts to support
> non-ASCII identifiers in an otherwise ASCII core syntax, implement the
> complexity of UAX #31 instead of allowing everything above ASCII in
> identifiers? In other words, what problem does making a programming
> language conform to UAX #31 solve?

After refreshing my memory of XML history, I realize that mentioning
XML does not helpfully illustrate my question despite the mention of
XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please
ignore the XML part.

Trying to rephrase my question more clearly:

Let's assume that we are designing a computer-parseable syntax where
tokens consisting of user-chosen characters can't occur next to each
other and, instead, always have some syntax-reserved characters
between them. That is, I'm talking about syntaxes that look like this
(could be e.g. Java):

ab.cd();

Here, ab and cd are tokens with user-chosen characters whereas space
(the indent),  period, parenthesis and the semicolon are
syntax-reserved. We know that ab and cd are distinct tokens, because
there is a period between them, and we know the opening parethesis
ends the cd token.

To illustrate what I'm explicitly _not_ talking about, I'm not talking
about a syntax like this:

αβ⊗γδ

Here αβ and γδ are user-named variable names and ⊗ is a user-named
operator and the distinction between different kinds of user-named
tokens has to be known somehow in order to be able to tell that there
are three distinct tokens: αβ, ⊗, and γδ.

My question is:

When designing a syntax where tokens with the user-chosen characters
can't occur next to each other without some syntax-reserved characters
between them, what advantages are there from limiting the user-chosen
characters according to UAX #31 as opposed to treating any character
that is not a syntax-reserved character as a character that can occur
in user-named tokens?

I understand that taking the latter approach allows users to mint
tokens that on some aesthetic measure don't make sense (e.g. minting
tokens that consist of glyphless code points), but why is it important
to prescribe that this is prohibited as opposed to just letting users
choose not to mint tokens that are inconvenient for them to work with
given the behavior that their plain text editor gives to various
characters? That is, why is conforming to UAX #31 worth the risk of
prohibiting the use of characters that some users might want to use?
The introduction of XID after ID and the introduction of Extended
Hashtag Identifiers after XID is indicative of over-restriction having
been a problem.

Limiting user-minted tokens to UAX #31 does not appear to be necessary
for security purposes considering that HTML and CSS exist in a
particularly adversarial environment and get away with taking the
approach that any character that isn't a syntax-reserved character is
collected as part of a user-minted identifier. (Informally, both treat
non-ASCII characters the same as an ASCII underscore. HTML even treats
non-whitespace, non-U+ ASCII controls that way.)

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Frédéric Grosshans via Unicode

  
  
Le 07/06/2018 à 18:01, Alastair
  Houghton a écrit :


  
  


I appreciate that the upshot of the Anglicised world of
  software engineering is that native English speakers have an
  advantage, and those for whom Latin isn’t their usual script
  are at a particular disadvantage, and I’m sure that seems
  unfair to many of us — but that doesn’t mean that allowing the
  use of other scripts everywhere, desirable as it is, is
  entirely unproblematic.
  

It depends of what what means by “allowing”, and it clearly can be
clearly problematic to use non ASCII characters. Restriction to (a
subset of) ASCII is indeed often the most reasonable choice, but
when on writes a specification on something which can be used in
many contexts (like url addresses, or a programming language), not
allowing it means forbidding it, even in contexts where it makes
sense. 

  [...]

  
If I understand you correctly, an Arabic
  speaker should always transliterate the function name to
  ASCII,
  



That’s one option; or they could write it in Arabic, but
  they need to be aware of the consequences of doing so (and
  those they are working for or with also need to understand
  that) [...];
  

We agree on this: they should be aware of the consequences. I think
these consequences should be essentially societal (as the example
you give), but not technical, since the first ones are supposed to
be well understood by everyone. 


  
[...]t.


  
  

  

   UAX #31 also manages (I
suspect unintentionally?) to give a good example of a
pair of Farsi identifiers that might be awkward to tell
apart in certain fonts, namely نامهای and نامه‌ای; I
think those are OK in monospaced fonts, where the join
is reasonably wide, but at small point sizes in
proportional fonts the difference in appearance is very
subtle, particularly for a non-Arabic speaker.
  
  In ASCII, identifiers with I, l, and 1 can be difficult to
  tell apart. And it is not an artificial problem: I’ve once
  had some difficulties with an automatically generated
  login which was do11y but tried to type dolly, despites my
  familiarity with ASCII. So I guess this problem is not
  specific to the ASCII vs non-ASCII debate

  


  
  It isn’t, though fonts used by programmers typically
emphasise the differences between I, l and 1 as well as 0 and O,
5 and S and so on specifically to avoid this problem.

In your example, you specifically mentioned that it “might be
awkward in certain fonts” but “OK in monospaced font”, so nothing
ASCII specific here. 


  
  
  But please don’t misunderstand; I am not — and have not been
— arguing against non-ASCII identifiers. We
were asked whether there were any problems. These are problems
(or perhaps we might call them “trade-offs”). We can debate the
severity of them, and whether, and what, it’s worthwhile doing
anything to mitigate any of them. What we shouldn’t do is sweep
them under the carpet.


I totally agree. (And I misunderstood you in the first place,
probably because “non-ASCII is bad, whatever the context” is a
common attitude in programmers, even non-Latin native ones.


  
  
  Personally I think a combination of documentation to explain
that it’s worth thinking carefully about which script(s) to use,
and some steps to consider certain characters to be equivalent
even though they aren’t the same (and shouldn’t be the same even
when normalised) might be a good idea. Is that really so
controversial a position?


Not at all. I misread “for reasonably wide values of ‘everyone’, at
any rate…” as saying “it is unreasonable to think of people not
comfortable with ASCII”, but it is clearly not what you intend to
say.

We both agree that:

  using non-ASCII identifiers instantly limit the number of
people who can work with them, so they should be used with care
  
  they however have some use cases, for users not familiar with
ASCII 
  

Frédéric

  



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Asmus Freytag via Unicode

  
  
On 6/7/2018 9:01 AM, Alastair Houghton
  via Unicode wrote:

But
  please don’t misunderstand; I am not — and have not been — arguing
  against non-ASCII identifiers. We were asked
  whether there were any problems. These are problems
  (or perhaps we might call them “trade-offs”). We can debate the
  severity of them, and whether, and what, it’s worthwhile doing
  anything to mitigate any of them. What we shouldn’t do is sweep
  them under the carpet.


Once you go beyond ASCII (or really any
small well-known set of shapes) to a very large universe like
Unicode, you will lose something and gain something. 
  
You will gain being able to express some
things more like they would be written in ordinary text.
You will lose by having identifiers that can
be more ambiguous, harder to recognize / replicate / display and
so on.
Where identifiers are "private", say limited
to the source code of a single application, the user/author is
in control and can avoid problematic cases. However, even for
source code, not all identifiers are truly private. Names for
classes and modules get turned into filenames, modules may be
shared, etc.
Code is also shared. If you use code that
ostensibly calls on a public library or module, but your identifier
system allows spoofing, your use of shared code may access
malicious code hiding behind lookalike names.
Unicode has dozens of character pairs that
look absolutely identical by design
(http://www.unicode.org/Public/security/10.0.0/intentional.txt
gives a subset of these), and many more combinations that could
look identical in any given font (but don't necessarily are so
in every font). Many of the latter are combining sequences that
are not normalized. 
  
For many complex scripts, not all possible
orderings of code points are well-behaved. Some may not render
on certain platforms / devices, while others do. Sometimes, two
alternative orderings will look the same.
  
Not paying attention to these issues will
cause your identifier system to be ill-behaved whenever it "leaks"
into public identifier space, particularly when your identifiers
become file names or names of network resources because you want
to allow sharing of libraries or modules.
---
The main point about allowing identifiers to
look like words is to make them mnemonic. Non-ASCII identifiers
can be more mnemonic to those that use other scripts. However,
one does not need to allow the full Unicode range unfiltered in
order to achieve mnemonic labels. There are many things in
Unicode needed for very specialized texts and while one can
always imagine some specialist delighting in writing a program where
some object is spelled precisely like it is in the "real" world,
there really is no need to allow such edge cases to undermine the
stability and security of a "reasonably" mnemonic system for the
wider body of users.
To give an example: in Arabic, you can
disallow *all* combining marks, and still get a strong
identifier system. In fact, it will be stronger, because many
accidental similarities to letter shapes will be eliminated (and
it's not necessary to devise some complex folding).
However, in Arabic again, there are pairs of
digits that look identical (and have the same numeric value).
Allowing these into identifiers without some folding would make
it impossible for users to know (without looking at the underlying
bits) how to type an identifier containing one of them.
Further in Arabic, several letter characters
may be different when in some positions in a word, but identical
if in another position in a word. Again, without some folding
there's no way you'll ever know which one.
  
A./
  
  PS: for the past several years, I've been part of a project that
  seeks to extend the types of domain names for top level domains to
  extend beyond ASCII. To get an idea of what that entails, check
  out https://icann.org/idn and look for "Root Zone Label Generation
  Rules", for example Arabic
(https://www.icann.org/sites/default/files/lgr/lgr-2-arabic-script-26jul17-en.html).
  For a detailed discussion of the design, see
(https://www.icann.org/en/system/files/files/arabic-lgr-proposal-18nov15-en.pdf).
  
  These are for Root Zone identifiers, which exclude digits for
  example, so you won't find discussion of digit-related issues. You
  also won't find mention of any "foldings", but that is because the
  Root Zone uses a related concept of "variant". For a programming
  

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Alastair Houghton via Unicode
On 7 Jun 2018, at 15:51, Frédéric Grosshans via Unicode  
wrote:
> 
>> IMO the major issue with non-ASCII identifiers is not a technical one, but 
>> rather that it runs the risk of fragmenting the developer community.  
>> Everyone can *type* ASCII and everyone can read Latin characters (for 
>> reasonably wide values of “everyone”, at any rate… most computer users 
>> aren’t going to have a problem). Not everyone can type Hangul, Chinese or 
>> Arabic (for instance), and there is no good fix or workaround for this.
> Well, your ”reasonable” value of everyone exclude many kids,

Every keyboard I’ve ever seen, including Chinese ones, is marked with ASCII 
characters as well. Typing ASCII on a machine in the Chinese locale might not 
be entirely straightforward, but entering Chinese characters, even on such a 
machine, takes significant training, and on a machine not set to Chinese locale 
it might even require the installation of additional software. It isn’t even 
the case, as I understand it, that all machines set to Chinese locales use the 
same input method, so being able to enter Chinese on one system doesn’t 
necessarily mean you’ll be able to do so on another. (I imagine it makes it 
easier to learn, once you’ve done it once, but still…)

I appreciate that the upshot of the Anglicised world of software engineering is 
that native English speakers have an advantage, and those for whom Latin isn’t 
their usual script are at a particular disadvantage, and I’m sure that seems 
unfair to many of us — but that doesn’t mean that allowing the use of other 
scripts everywhere, desirable as it is, is entirely unproblematic.

>> it isn’t obvious to a non-Arabic speaking user how to enter الطول in order 
>> to call it.
> OK. Clearly, someone not knowing the Arabic alphabet will have difficulties 
> with this one, but if one has good reason to think the targeted developper 
> community is literate in Arabic and a lower mastery of the latin alphabet, it 
> still may be a good idea.
> If I understand you correctly, an Arabic speaker should always transliterate 
> the function name to ASCII,

That’s one option; or they could write it in Arabic, but they need to be aware 
of the consequences of doing so (and those they are working for or with also 
need to understand that); or they could choose some other language, perhaps one 
shared with other teams who are likely to work on the code. Imagine you 
outsourced development to a team that happened to be Arabic speaking, and they 
developed (let’s say) French language software for you, but later you wanted to 
bring development in house and found all the identifiers were in Arabic script, 
which made the code very difficult for your developers to work with. That isn’t 
exactly going to make your day, and if it isn’t a problem that anyone has 
mentioned, it might not be obvious that you when you originally outsourced your 
development that you needed to make sure people weren't going to do that.

>>  UAX #31 also manages (I suspect unintentionally?) to give a good example of 
>> a pair of Farsi identifiers that might be awkward to tell apart in certain 
>> fonts, namely نامهای and نامه‌ای; I think those are OK in monospaced fonts, 
>> where the join is reasonably wide, but at small point sizes in proportional 
>> fonts the difference in appearance is very subtle, particularly for a 
>> non-Arabic speaker.
> In ASCII, identifiers with I, l, and 1 can be difficult to tell apart. And it 
> is not an artificial problem: I’ve once had some difficulties with an 
> automatically generated login which was do11y but tried to type dolly, 
> despites my familiarity with ASCII. So I guess this problem is not specific 
> to the ASCII vs non-ASCII debate

It isn’t, though fonts used by programmers typically emphasise the differences 
between I, l and 1 as well as 0 and O, 5 and S and so on specifically to avoid 
this problem.

But please don’t misunderstand; I am not — and have not been — arguing against 
non-ASCII identifiers. We were asked whether there were any problems. These are 
problems (or perhaps we might call them “trade-offs”). We can debate the 
severity of them, and whether, and what, it’s worthwhile doing anything to 
mitigate any of them. What we shouldn’t do is sweep them under the carpet.

Personally I think a combination of documentation to explain that it’s worth 
thinking carefully about which script(s) to use, and some steps to consider 
certain characters to be equivalent even though they aren’t the same (and 
shouldn’t be the same even when normalised) might be a good idea. Is that 
really so controversial a position?

Kind regards,

Alastair.

--
http://alastairs-place.net



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Frédéric Grosshans via Unicode

Le 06/06/2018 à 11:29, Alastair Houghton via Unicode a écrit :

On 4 Jun 2018, at 20:49, Manish Goregaokar via Unicode  
wrote:

The Rust community is considering adding non-ascii identifiers, which follow 
UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for 
identifiers to be treated as equivalent under NFKC.

Are there any cases where this will lead to inconsistencies? I.e. can the NFKC 
of a valid UAX 31 ident be invalid UAX 31?

(In general, are there other problems folks see with this proposal?)

IMO the major issue with non-ASCII identifiers is not a technical one, but 
rather that it runs the risk of fragmenting the developer community.  Everyone 
can *type* ASCII and everyone can read Latin characters (for reasonably wide 
values of “everyone”, at any rate… most computer users aren’t going to have a 
problem). Not everyone can type Hangul, Chinese or Arabic (for instance), and 
there is no good fix or workaround for this.
Well, your ”reasonable” value of everyone exclude many kids, and puts 
social barriers in the use of computer to non-native latin writers. If 
the programme has no reason to be read and written by foreign 
programmers, why not use native language and alphabet identifiers? Of 
course, as long as you write a function named الطول, you consciously 
restrict the developer community having access to this programme. But 
you also make your programme more clear to your arabic speaking 
community. If said community is e.g. school teachers (or students) in an 
arab speaking country, it may be a good choice. I don’t see the 
difference with choosing to write a book in a language or another.

Note that this is orthogonal to issues such as which language identifiers [...] 
are written in [...];

It is indeed different, but not orthogonal


the problem is that e.g. given a function

   func الطول(s : String)

it isn’t obvious to a non-Arabic speaking user how to enter الطول in order to 
call it.
OK. Clearly, someone not knowing the Arabic alphabet will have 
difficulties with this one, but if one has good reason to think the 
targeted developper community is literate in Arabic and a lower mastery 
of the latin alphabet, it still may be a good idea.
If I understand you correctly, an Arabic speaker should always 
transliterate the function name to ASCII, and there are many different 
way to do it  (see e.g. 
https://en.wikipedia.org/wiki/Romanization_of_Arabic). Should they name 
his function altawil, altwl, alt.wl ? And when calling it later, they 
should remember their ad-hoc ASCII Arabic orthography. I don’t soubt 
many, if not most, do it, but it can add an extra burden in programming. 
It’s a bit like remembering if your name should be transliterated in 
Greek as Ηουγητον or Ουχτων, and use that for every identifier you come 
across. A mitigation strategy is to name your identifier x1, x2, x3 and 
so on. The common knowledge is that this is a bad idea, and programming 
teachers spend some time discouraging their student to use such a 
strategy. However, many Chinese website and email addresses are of this 
form, because it is the only one clear enough for a big fraction of the 
population.




This isn’t true of e.g.

   func pituus(s : String)

Even though “pituus” is Finnish, it’s still ASCII and everyone knows how to 
type that.


Avoiding “special characters” can be annoying in Latin based language, 
specially for beginners, and kids among them. Unicode (too slow) 
adoption has already eased the difficulty of writing a “Hello world” 
and  “What‘s your name programme”, but avoiding non-ASCII characters in 
identifiers can be a bit esoteric for kids with a native language full 
of them. (And by the way, several big French companies regularly send me 
mail with my first name mojibakeed, while their software is presumably 
written by adults)


[...]


  UAX #31 also manages (I suspect unintentionally?) to give a good example of a 
pair of Farsi identifiers that might be awkward to tell apart in certain fonts, 
namely نامهای and نامه‌ای; I think those are OK in monospaced fonts, where the 
join is reasonably wide, but at small point sizes in proportional fonts the 
difference in appearance is very subtle, particularly for a non-Arabic speaker.
In ASCII, identifiers with I, l, and 1 can be difficult to tell apart. 
And it is not an artificial problem: I’ve once had some difficulties 
with an automatically generated login which was do11y but tried to type 
dolly, despites my familiarity with ASCII. So I guess this problem is 
not specific to the ASCII vs non-ASCII debate


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Mark Davis ☕️ via Unicode
Got it, thanks.

Mark

On Thu, Jun 7, 2018 at 3:29 PM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Thu, 7 Jun 2018 10:42:46 +0200
> Mark Davis ☕️ via Unicode  wrote:
>
> > > The proposal also asks for identifiers to be treated as equivalent
> > > under
> > NFKC.
> >
> > The guidance in #31 may not be clear. It is not to replace
> > identifiers as typed in by the user by their NFKC equivalent. It is
> > rather to internally *identify* two identifiers (as typed in by the
> > user) as being the same. For example, Pascal had case-insensitive
> > identifiers. That means someone could type in
> >
> > myIdentifier = 3;
> > MyIdentifier = 4;
> >
> > And both of those would be references to the same internal entity. So
> > cases like SARA AM doesn't necessarily play into this.
>
> There has been a suggestion to not just restrict identifiers to NFKC
> equivalence classes (UAX31-R4), but to actually restrict them to NFKC
> form (UAX31-R6).  That is where the issue with SARA AM changes from a
> lurking issue to an active problem.  Others have realised that NFC
> makes more sense than NFKC for Rust.
>
> Richard.
>
>
>


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Richard Wordingham via Unicode
On Thu, 7 Jun 2018 10:42:46 +0200
Mark Davis ☕️ via Unicode  wrote:

> > The proposal also asks for identifiers to be treated as equivalent
> > under  
> NFKC.
> 
> The guidance in #31 may not be clear. It is not to replace
> identifiers as typed in by the user by their NFKC equivalent. It is
> rather to internally *identify* two identifiers (as typed in by the
> user) as being the same. For example, Pascal had case-insensitive
> identifiers. That means someone could type in
> 
> myIdentifier = 3;
> MyIdentifier = 4;
> 
> And both of those would be references to the same internal entity. So
> cases like SARA AM doesn't necessarily play into this.

There has been a suggestion to not just restrict identifiers to NFKC
equivalence classes (UAX31-R4), but to actually restrict them to NFKC
form (UAX31-R6).  That is where the issue with SARA AM changes from a
lurking issue to an active problem.  Others have realised that NFC
makes more sense than NFKC for Rust.

Richard.




Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Richard Wordingham via Unicode
On Thu, 7 Jun 2018 13:32:13 +0200
Joan Montané via Unicode  wrote:

> 2018-06-04 21:49 GMT+02:00 Manish Goregaokar via Unicode <
> unicode@unicode.org>:  

> * Ŀ, LATIN CAPITAL LETTER L WITH MIDDEL DOT  NFKC decomposes
> to LATIN CAPITAL LETTER L (U+004C) MIDDLE DOT (U+00B7): 
> * ŀ, LATIN SMALL LETTER L WITH MIDDLE DOT  NFKC decomposes to
> LATIN SMALL LETTER L (U+006C) MIDDLE DOT (U+00B7): 

This is only a problem if U+00B7 is part of Rust's syntax.  U+00B7 has
the properties (X)ID_continue, so there is no formal problem.

Richard.



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Philippe Verdy via Unicode
If you intend to allow all the standard orthography of common languages,
you would also need to support apostrophes and regular hyphens in
identifiers, including those from ASCII !

The Catalan middle dot is just a compact variant of the hyphen, it should
have better been a diacritic, but the usage of upper diacritics on letter
l/L with high ascenders caused problems when rendering with compact
line-heights.

Polish chose to use a smart overstriking slash to avoid that problem,
another diacritic could have been used such as the cedilla below, but the
middle dot was easier to add between the two handwritten "ll", after
composing the rest of the word) without having to release the drawing pen
from the surface.

The vertical placement of the "middle" dot is also largely variable when
handwritten, I have seen it drawn manuall as a short stroke (horizontal or
slanted), which is easier to place manually (the dot can easily fall on the
vertical strokes, and when "ll" is handdrawn it frequently has the two
curls touching each other, so the dot may in fact call in the middle of the
curl for the first l), and it that case it looks very much like the Polish
l with a stroke bar, or like a l followed by an apostrophe before the
second l.


2018-06-07 13:32 GMT+02:00 Joan Montané via Unicode :

>
>
> 2018-06-04 21:49 GMT+02:00 Manish Goregaokar via Unicode <
> unicode@unicode.org>:
>
>> Hi,
>>
>> The Rust community is considering
>>  adding non-ascii
>> identifiers, which follow UAX #31 
>> (XID_Start XID_Continue*, with tweaks). The proposal also asks for
>> identifiers to be treated as equivalent under NFKC.
>>
>> Are there any cases where this will lead to inconsistencies? I.e. can the
>> NFKC of a valid UAX 31 ident be invalid UAX 31?
>>
>
> Yes, such case exists, for instance in Latin alphabet and Catalan language.
>
> * Ŀ, LATIN CAPITAL LETTER L WITH MIDDEL DOT  NFKC decomposes to
> LATIN CAPITAL LETTER L (U+004C) MIDDLE DOT (U+00B7): 
> * ŀ, LATIN SMALL LETTER L WITH MIDDLE DOT  NFKC decomposes to
> LATIN SMALL LETTER L (U+006C) MIDDLE DOT (U+00B7): 
>
> Ŀ and ŀ are (were) used for Catalan language for encoding geminate L [1]
> when it is (was) encoded using 2 chars only. Preferred (and common used)
> encoding is currently that of 3 chaacters: . So, some adjustments
> are needed if you whant to support Catalan language identifiers [2]
>
> Yours,
> Joan Montané
>
>
> [1] https://en.wikipedia.org/wiki/Interpunct#Catalan
> [2] http://www.unicode.org/reports/tr31/#Specific_Character_Adjustments
>
>


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Joan Montané via Unicode
2018-06-04 21:49 GMT+02:00 Manish Goregaokar via Unicode <
unicode@unicode.org>:

> Hi,
>
> The Rust community is considering
>  adding non-ascii
> identifiers, which follow UAX #31 
> (XID_Start XID_Continue*, with tweaks). The proposal also asks for
> identifiers to be treated as equivalent under NFKC.
>
> Are there any cases where this will lead to inconsistencies? I.e. can the
> NFKC of a valid UAX 31 ident be invalid UAX 31?
>

Yes, such case exists, for instance in Latin alphabet and Catalan language.

* Ŀ, LATIN CAPITAL LETTER L WITH MIDDEL DOT  NFKC decomposes to
LATIN CAPITAL LETTER L (U+004C) MIDDLE DOT (U+00B7): 
* ŀ, LATIN SMALL LETTER L WITH MIDDLE DOT  NFKC decomposes to LATIN
SMALL LETTER L (U+006C) MIDDLE DOT (U+00B7): 

Ŀ and ŀ are (were) used for Catalan language for encoding geminate L [1]
when it is (was) encoded using 2 chars only. Preferred (and common used)
encoding is currently that of 3 chaacters: . So, some adjustments
are needed if you whant to support Catalan language identifiers [2]

Yours,
Joan Montané


[1] https://en.wikipedia.org/wiki/Interpunct#Catalan
[2] http://www.unicode.org/reports/tr31/#Specific_Character_Adjustments


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Hans Åberg via Unicode
Now that the distinction is possible, it is recommended to do that.

My original question was directed to the OP, whether it is deliberate.

And they are confusables only to those not accustomed to it.


> On 7 Jun 2018, at 12:05, Philippe Verdy  wrote:
> 
> In my opinion the usual constant is most often shown as  "휋" (curly serifs, 
> slightly slanted) in mathematical articles and books (and in TeX), but rarely 
> as "π" (sans-serif).
> 
> There's a tradition of using handwriting for this symbol on backboards (not 
> always with serifs, but still often slanted). A notation with the "π" symbol 
> uses a legacy troundtrip mapping for old OEM charsets on low-resolution text 
> terminals where it was distinguished from the more common Greek letter which 
> was enhanced for better readability once old low-resolution terminals were 
> replaced. "π" looks too much like an Hangul letter or a legacy box-drawing 
> character and in fact difficult to recognize as the pi constant, but it may 
> still be found in some plain-text paragraphs of inline mathematical formulas 
> on screens (for programmers), at low resolution or with small font sizes, 
> where most text is in sans-serif Latin and not slanted/italicized and not 
> using an handwritten style.
> 
> If you think about writing a functional programming language using inline 
> formulas, then  the "π" symbol may be ok for the constant, and custom 
> identifiers for a function would use standard Greek letters (or other 
> standard scripts for human languages), or would use "pi" in Latin. You would 
> then write "pi(π)" in that inline formula. For a classic 2D mathematical 
> layout, you would use  "pi(휋)" with distinctive but homonegeous styles for 
> custom variables/function names and for the classic mathematical constant.
> 
> As much as possible you will avoid mixing confusive letters/symbols in that 
> language.
> 
> Confusion is still possible is you use old texts mixing old Greek letters for 
> numerals: you would in that case avoid using the Greek letter pi for naming 
> your custom function, and would reserve the pi letter for the wellknown 
> constant. But applying distinctive styles will enhance your formulas for 
> readability.




Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Philippe Verdy via Unicode
In my opinion the usual constant is most often shown as  "휋" (curly
serifs, slightly slanted) in mathematical articles and books (and in TeX),
but rarely as "π" (sans-serif).

There's a tradition of using handwriting for this symbol on backboards (not
always with serifs, but still often slanted). A notation with the "π"
symbol uses a legacy troundtrip mapping for old OEM charsets on
low-resolution text terminals where it was distinguished from the more
common Greek letter which was enhanced for better readability once old
low-resolution terminals were replaced. "π" looks too much like an Hangul
letter or a legacy box-drawing character and in fact difficult to recognize
as the pi constant, but it may still be found in some plain-text paragraphs
of inline mathematical formulas on screens (for programmers), at low
resolution or with small font sizes, where most text is in sans-serif Latin
and not slanted/italicized and not using an handwritten style.

If you think about writing a functional programming language using inline
formulas, then  the "π" symbol may be ok for the constant, and custom
identifiers for a function would use standard Greek letters (or other
standard scripts for human languages), or would use "pi" in Latin. You
would then write "pi(π)" in that inline formula. For a classic 2D
mathematical layout, you would use  "pi(휋)" with distinctive but
homonegeous styles for custom variables/function names and for the classic
mathematical constant.

As much as possible you will avoid mixing confusive letters/symbols in that
language.

Confusion is still possible is you use old texts mixing old Greek letters
for numerals: you would in that case avoid using the Greek letter pi for
naming your custom function, and would reserve the pi letter for the
wellknown constant. But applying distinctive styles will enhance your
formulas for readability.

2018-06-06 23:25 GMT+02:00 Hans Åberg via Unicode :

>
> > On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode <
> unicode@unicode.org> wrote:
> >
> > The Rust community is considering adding non-ascii identifiers, which
> follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also
> asks for identifiers to be treated as equivalent under NFKC.
>
> So, in this language, if one defines a projection function 휋 and the
> usual constant π, what is 휋(π) supposed to mean? - Just curious.
>
>
>
>


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Mark Davis ☕️ via Unicode
> The proposal also asks for identifiers to be treated as equivalent under
NFKC.

The guidance in #31 may not be clear. It is not to replace identifiers as
typed in by the user by their NFKC equivalent. It is rather to internally
*identify* two identifiers (as typed in by the user) as being the same. For
example, Pascal had case-insensitive identifiers. That means someone could
type in

myIdentifier = 3;
MyIdentifier = 4;

And both of those would be references to the same internal entity. So cases
like SARA AM doesn't necessarily play into this.

> IMO the major issue with non-ASCII identifiers is not a technical one,
but rather that it runs the risk of fragmenting the developer community.

IMO, forcing everyone to stick to the limitations of ASCII for all
identifiers is unnecessary and often counterproductive.

First, programmers tend to think of "identifiers" as being specifically
"identifiers in programming languages" (and often "identifiers in
programming languages that I think are important". Identifiers may occur in
much broader contexts, often being much closer to end users (eg spreadsheet
formulae) or scripting languages, user identifiers, and so on.

Secondly, even with programming languages that are restricted to ASCII,
people can choose identifiers in code like the following, which would not
be obvious to many people.

var Stellenwert = Verteidigungsministerium_Konto.verarbeite(); // Asmus
könnte realistischere Beispiele vorschlagen

For a given project, and for programming languages (as opposed to more
user-facing languages) the language to be used for variables, functions,
comments,  will often be English, to allow for broader participation.
But that should be a choice of the people involved. There are clearly many
cases where that restriction is not optimal for a given project, where not
all of the developers (and prospective developers) are fluent in English,
but do share another common language. Think of all the in-house development
in countries and organizations around the world.

And finally, it's not like you hear of huge problems from Java or Swift or
other programming languages because they support non-ASCII identifiers.


Mark

On Thu, Jun 7, 2018 at 9:36 AM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Tue, 5 Jun 2018 01:37:47 +0100
> Richard Wordingham via Unicode  wrote:
>
> > The decomposed
> > form that looks the same is นํ้า .
> > The problem is that for sane results,  needs
> > special handling. This sequence is also often untypable - part of the
> > protection against Thai homographs.
>
> I've been misquoted on the Rust discussion topic - or the behaviour is
> more diverse that I was aware of.  On LibreOffice, with sequence
> checking not disabled, typing  disables the input by
> typing of U+0E49 or U+0E32 immediately afterwards.  Another mechanism
> is for typing another vowel to replace the U+0E4D.  The problem here is
> that in standard Thai, U+0E4D may not be followed by another vowel or
> tone mark, so Wing Thuk Thi (WTT) rules cut in.  (They're also quite
> good at preventing one from typing Northern Khmer.)  In LibreOffice,
> typing the NFKC form  is stopped at
> attempting to type U+0E4D, though one can get back to the original by
> typing U+0E33 instead.  To the rule checker, that is mission
> accomplished!
>
> Richard.
>
>


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Hans Åberg via Unicode


> On 7 Jun 2018, at 03:56, Asmus Freytag via Unicode  
> wrote:
> 
> On 6/6/2018 2:25 PM, Hans Åberg via Unicode wrote:
>>> On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode 
>>>  wrote:
>>> 
>>> The Rust community is considering adding non-ascii identifiers, which 
>>> follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also 
>>> asks for identifiers to be treated as equivalent under NFKC.
>>> 
>> So, in this language, if one defines a projection function 휋 and the usual 
>> constant π, what is 휋(π) supposed to mean? - Just curious.
>> 
> In a language where one writes ASCII "pi" instead, what is pi(pi) supposed to 
> mean?

Indeed.





Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Alastair Houghton via Unicode
On 6 Jun 2018, at 17:50, Manish Goregaokar  wrote:
> 
> I think the recommendation to use ASCII as much as possible is implicit there.

It would be a very good idea to make it explicit. Even for English speakers, 
there may be a temptation to use characters that are hard to distinguish or 
hard to type on someone else’s keyboard; some thought needs to be given before 
choosing non-ASCII identifiers. Sometimes you might even choose to support 
multiple spellings of an API to avoid any problems. And in other cases it’s a 
good idea to remember that someone other than you might have to maintain your 
code in the future; that person might not speak the same language you do or use 
the same keyboard.

Kind regards,

Alastair.

--
http://alastairs-place.net



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Richard Wordingham via Unicode
On Tue, 5 Jun 2018 01:37:47 +0100
Richard Wordingham via Unicode  wrote:

> The decomposed
> form that looks the same is นํ้า .
> The problem is that for sane results,  needs
> special handling. This sequence is also often untypable - part of the
> protection against Thai homographs.

I've been misquoted on the Rust discussion topic - or the behaviour is
more diverse that I was aware of.  On LibreOffice, with sequence
checking not disabled, typing  disables the input by
typing of U+0E49 or U+0E32 immediately afterwards.  Another mechanism
is for typing another vowel to replace the U+0E4D.  The problem here is
that in standard Thai, U+0E4D may not be followed by another vowel or
tone mark, so Wing Thuk Thi (WTT) rules cut in.  (They're also quite
good at preventing one from typing Northern Khmer.)  In LibreOffice,
typing the NFKC form  is stopped at
attempting to type U+0E4D, though one can get back to the original by
typing U+0E33 instead.  To the rule checker, that is mission
accomplished!

Richard.



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Richard Wordingham via Unicode
On Mon, 4 Jun 2018 12:49:20 -0700
Manish Goregaokar via Unicode  wrote:

> Hi,
> 
> The Rust community is considering
>  adding non-ascii
> identifiers, which follow UAX #31
>  (XID_Start XID_Continue*, with
> tweaks). The proposal also asks for identifiers to be treated as
> equivalent under NFKC.
> 
> Are there any cases where this will lead to inconsistencies? I.e. can
> the NFKC of a valid UAX 31 ident be invalid UAX 31?
> 
> (In general, are there other problems folks see with this proposal?)

Confusable checking may need to be reviewed.  There are several cases
where, sometimes depending on the font, anagrams (differing
even after normalisation) can render the same. The examples I
know of are of from SE Asia. The categories I know of are:

a) Swapping subscript letters - a big issue in the Myanmar script, but
Sanskrit grv- and gvr- can easily be rendered the same.  I don't know
how easily confusion arises by 'finger trouble'.

b) Vowel-subscript consonant and subscript consonant-vowel often look
the same in Khmer and Tai Tham.  The former spelling was supposedly
dropped in Khmer a century ago (the consonant ceasing to be subscript),
but lingered on in a few words and is acknowledged by Unicode but not by
the Microsoft font developer's guide.

c) Unresolved grammar.  In Thai minority languages, U+0E3A THAI
CHARACTER PHINTHU and a mark above (U+0E34 THAI CHARACTER SARA I, I
believe) can and do occur in either order, with no difference in
appearance or meaning.

The obvious humane solution is a brutal folding of the sequences.
(Using spell-checkers works wonders on normal text, but spell
checking code is tricky.) 

I actually suggested a character (U+1A54 TAI THAM LETTER GREAT SA) so
that folding 'ses' to 'sse' would not result in the 'ss' conjunct being
used; the conjunct is not used in 'ses'.

Richard.


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Asmus Freytag via Unicode

  
  
On 6/6/2018 2:25 PM, Hans Åberg via
  Unicode wrote:


  

  
On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode  wrote:

The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to be treated as equivalent under NFKC.

  
  
So, in this language, if one defines a projection function 휋 and the usual constant π, what is 휋(π) supposed to mean? - Just curious.






In a language where one writes ASCII "pi"
instead, what is pi(pi) supposed to mean?
A./

  



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Hans Åberg via Unicode


> On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode  
> wrote:
> 
> The Rust community is considering adding non-ascii identifiers, which follow 
> UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for 
> identifiers to be treated as equivalent under NFKC.

So, in this language, if one defines a projection function 휋 and the usual 
constant π, what is 휋(π) supposed to mean? - Just curious.





Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Henri Sivonen via Unicode
On Mon, Jun 4, 2018 at 10:49 PM, Manish Goregaokar via Unicode
 wrote:
> The Rust community is considering adding non-ascii identifiers, which follow
> UAX #31 (XID_Start XID_Continue*, with tweaks).

UAX #31 is rather light on documenting its rationale.

I realize that XML is a different case from Rust considering how the
Rust compiler is something a programmer runs locally whereas control
XML documents and XML processors, especially over time, is
significantly less coupled.

Still, the experience from XML and HTML suggests that, if non-ASCII is
to be allowed in identifiers at all, restricting the value space of
identifiers a priori easily ends up restricting too. HTML went with
the approach of collecting everything up to the next ASCII code point
that's a delimiter in HTML (and a later check for names that are
eligible for Custom Element treatment that mainly achieves
compatibility with XML but no such check for what the parser can
actually put in the document tree) while keeping the actual vocabulary
to ASCII (except for Custom Elements whose seemingly arbitrary
restrictions are inherited from XML).

XML 1.0 codified for element and attribute names what then was the
understanding of the topic that UAX #31 now covers and made other
cases a hard failure. Later, it turned out that XML originally ruled
out too much and the whole mess that was XML 1.1 and XML 1.0 5th ed.
resulted from trying to relax the rules.

Considering that ruling out too much can be a problem later, but just
treating anything above ASCII as opaque hasn't caused trouble (that I
know of) for HTML other than compatibility issues with XML's stricter
stance, why should a programming language, if it opts to support
non-ASCII identifiers in an otherwise ASCII core syntax, implement the
complexity of UAX #31 instead of allowing everything above ASCII in
identifiers? In other words, what problem does making a programming
language conform to UAX #31 solve?

Allowing anything above ASCII will lead to some cases that obviously
don't make sense, such as declaring a function whose name is a
paragraph separator, but why is it important to prohibit that kind of
thing when prohibiting things risks prohibiting too much, as happened
with XML, and people just don't mint identifiers that aren't practical
to them? Is there some important badness prevention concern that
applies to programming languages more than it applies to HTML? The key
thing here in terms of considering if badness is _prevented_ isn't
what's valid HTML but what the parser can actually put in the DOM, and
the HTML parser can actually put any non-ASCII code point in the DOM
as an element or attribute name (after the initial ASCII code point).

(The above question is orthogonal to normalization. I do see the value
of normalizing identifiers to NFC or requiring them to be in NFC to
begin with. I'm inclined to consider NFKC as a bug in the Rust
proposal.)
-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Philippe Verdy via Unicode
It could be argued that "modern" languages could use unique identifiers for
their syntax or API independantly of the name being rendered. The problem
is that translated names may collide in non-obvious way and become
ambiguous.
We've already seen the problems it caused in Excel with its translated
function names in some spreadsheets (things being worse when the
spreadsheet itself does not contain a language identifier to indicate in
which these identifiers are defined, so English-only installations of Excel
(without the MUI/LUI installed) cannot open or process correctly the
spreadsheets created in other languages.

In practice, ASCII-only or ISO8859-1 only identifiers work realtively well,
but there's always a problem to enter these identifiers, a solution would
be to allow identifiers having an ASCII-only alias even if they are not so
friendly for the original authors. But I've not seen any programming
language or API allowing to define aliases for identifiers that have
exactly the same semantic as the few translated ones that non-English users
would prefer to see and use. In C/C++ you may have aliases but this
requires special support in the binary object or library format to allow
equivalent bindings and resolution.

For programming languages that are too near from the machine level
(assembly, C, C++), or for common libraries intended to be used worldwide,
in most cases these names are in English-only or use "augmented English"
with approximate transliterations when they use some borrowed words
(notably proper names), or invented words (company names, trademarks,
custom neologisms specific to an app or service, and a lot of acronyms).
These API or languages tend to create their own "jargon" with their own
definitions (which may be translated in their documentation).
Programmer comments however are very frequently written in any language or
script because they don't have to be restricted by uniqueness and name
resolution or binding mechanisms.
But newer scripting languages are now very liberal (notably
Javascript/ECMAscript) and are somewhat easy to rebind to other names to
generate an "equivalent" library, except if the library needs to work
through reflection mechanisms and introspection. scripting languages
designed to be used for user personalisation should however be user
friendly and only designed to work well with the language of the initial
author for his own usage (but cooperation will be limited on the Internet,
and if one wants to share his code, he will have to create some basic
translation or transliteration.

Most system-level APIs (filesystem or I/O, multiprocessing/multithreading,
networking) and data format options are specified using English terms only
(or near-English). The various IDE's however can make this language more
friendly by providing documentation searches, contextual helpers in the
editor itself, hinting popups, or various "machine learning" tools
(including "natural language" query wizards to help create and document the
technical language using the English-like jargon).

Most programming languages however do not define a lot of reserved keywords
(in English) and there's rarely the need to translate them (but I've seen
several programming languages also translating them in a few wellknown
languages), notably languages designed to be used by children or to learn
programming. Some of these languages do not use a plain-text syntax but use
graphic diagrams with symbols, arrows, boxes and programmers navigate in
the graphic layout or rearrange the layout to fit new items or
remove/combine them (then an "advanced" view can be used to present this
layout in plain-text using partly translated terms: this is easier if
there's a clear syntaxic separation of custom identifiers created by users
(not translated) and core keywords of the language (generally this
separation uses quotation marks around custom identifiers, but this is not
even needed everywhere for data-oriented syntaxes like JSON which does not
need any "reserved" identifier, but reserves only  some punctuations).

Anyway, all programming jobs require a basic proficiency to read/write
basic English correctly, and require acquiring a common English-like
technical jargon (that jargon does not have to be perfect English, it is
used as a de facto standard, which evolves too fast to be correctly
translated). This jargon is still NOT normal English and using it means
that documentation should still be adapted/translated to better English for
native English readers. If you look at some wellknown projects in China,
you'll see that many projects are documented and supported only in Chinese,
by programmers that have a very limtied knowledge of English (so their
usage of Engliush in the crearted technical jargon is liguistically
incorrect, but still correct for the technical needs (and to
translate/Adapt these programs to other languages, Chinese is the source of
all translations, and must be present in all translation files to 

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Alastair Houghton via Unicode
On 5 Jun 2018, at 07:09, Martin J. Dürst via Unicode  
wrote:
> 
> Hello Rebecca,
> 
> On 2018/06/05 12:43, Rebecca T via Unicode wrote:
> 
>> Something I’d love to see is translated keywords; shouldn’t be hard with a
>> line in the cargo.toml for a ruidmentary lookup. Again, I’m of the opinion
>> that an imperfect implementation is better than no attempt. I remember
>> reading an article about a professor who translated the keywords in...
>> maybe it was Python? And found their students were much more engaged with
>> the material. Anecdotal, of course, but it’s stuck with me.
> 
> It would be good to have a reference for this. I can certainly see the point. 
> But on the other hand, I have also heard that using keywords in a foreign 
> language makes it clear that there may be a difference between the everyday 
> use of the word and the specific formal meaning in the programming language. 
> Then, there's also the problem that just translating keywords may work for 
> languages with the same sentence structure, but not for languages with a 
> completely different sentence structure. On top of that, keywords are just a 
> start; class/function/method names in libraries would have to be translated, 
> too, which would be much more work (especially if one wants to do a good job).

ALGOL68 was apparently localised (the standard explicitly supported that; it 
wasn’t an extension but rather something explicitly encouraged).  AppleScript 
was also designed to be (French and Japanese syntaxes were defined), and I have 
an inkling that someone once told me that at least one translation had actually 
shipped, though the translated variants are now deprecated as far as I’m aware.

Translated keywords are in some ways better than allowing non-ASCII 
identifiers, because they’re typically amenable to machine translation (indeed, 
in AppleScript, the scripts are not usually saved in ASCII anyway, but IIRC as 
a set of Apple Event Descriptors, so the “language” is just a matter for 
rendering to the user), which means that they don’t suffer from the problem of 
community fragmentation that non-ASCII identifiers *could* cause.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Alastair Houghton via Unicode
On 4 Jun 2018, at 20:49, Manish Goregaokar via Unicode  
wrote:
> 
> The Rust community is considering adding non-ascii identifiers, which follow 
> UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for 
> identifiers to be treated as equivalent under NFKC.
> 
> Are there any cases where this will lead to inconsistencies? I.e. can the 
> NFKC of a valid UAX 31 ident be invalid UAX 31?
> 
> (In general, are there other problems folks see with this proposal?)

IMO the major issue with non-ASCII identifiers is not a technical one, but 
rather that it runs the risk of fragmenting the developer community.  Everyone 
can *type* ASCII and everyone can read Latin characters (for reasonably wide 
values of “everyone”, at any rate… most computer users aren’t going to have a 
problem).  Not everyone can type Hangul, Chinese or Arabic (for instance), and 
there is no good fix or workaround for this.

Note that this is orthogonal to issues such as which language identifiers or 
comments are written in (indeed, there’s no problem with comments written in 
any script you please); the problem is that e.g. given a function

  func الطول(s : String)

it isn’t obvious to a non-Arabic speaking user how to enter الطول in order to 
call it.  This isn’t true of e.g.

  func pituus(s : String)

Even though “pituus” is Finnish, it’s still ASCII and everyone knows how to 
type that.

Copy and paste is not always a good solution here, I might add; in bidi text in 
particular, copy and paste can have confusing results (and results that vary 
depending on the editor being used).  There is also the issue of additional 
confusions that might be introduced; even if you stick to Latin scripts, this 
could be an problem sometimes (e.g. at small sizes, it’s hard to distinguish ă 
and ǎ or ȩ and ę), and of course there are Cyrillic and Greek characters that 
are indistinguishable from their Latin counterparts in most fonts.  UAX #31 
also manages (I suspect unintentionally?) to give a good example of a pair of 
Farsi identifiers that might be awkward to tell apart in certain fonts, namely 
نامهای and نامه‌ای; I think those are OK in monospaced fonts, where the join is 
reasonably wide, but at small point sizes in proportional fonts the difference 
in appearance is very subtle, particularly for a non-Arabic speaker.

You could avoid *some* of these issues by restricting the allowable scripts 
somehow (e.g. requiring that an identifier that had Latin characters could not 
also contain Cyrillic and so on) or perhaps by establishing additional 
canonical equivalences between similar looking characters (so that e.g. while a 
and а - or, more radically, ă and ǎ - might be different characters, you might 
nevertheless regard them as the same for symbol lookup).  It might be worth 
looking at UTR #36 and maybe UTR #39, not so much from a security standpoint, 
but more because those documents already have to deal with the problem of 
confusables.

You could also recommend that people stick to ASCII unless there’s a good 
reason to do otherwise (and note that using non-ASCII characters might impact 
on their ability to collaborate with teams in other countries).

None of this is necessarily a reason *not* to support non-ASCII identifiers, 
but it *is* something to be cautious about.  Right now, most programming 
languages operate as a lingua franca, with code written by a wide range of 
people, not all of whom speak English, but all of whom can collaborate together 
to a greater or lesser degree by virtue of the fact that they all understand 
and can write code.  Going down this particular rabbit hole risks changing 
that, and not for the better, and IMO it’s important to understand that when 
considering whether the trade-off of being able to use non-ASCII characters in 
identifiers is genuinely worth it.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-05 Thread Martin J. Dürst via Unicode

Hello Rebecca,

On 2018/06/05 12:43, Rebecca T via Unicode wrote:


Something I’d love to see is translated keywords; shouldn’t be hard with a
line in the cargo.toml for a ruidmentary lookup. Again, I’m of the opinion
that an imperfect implementation is better than no attempt. I remember
reading an article about a professor who translated the keywords in...
maybe it was Python? And found their students were much more engaged with
the material. Anecdotal, of course, but it’s stuck with me.


It would be good to have a reference for this. I can certainly see the 
point. But on the other hand, I have also heard that using keywords in a 
foreign language makes it clear that there may be a difference between 
the everyday use of the word and the specific formal meaning in the 
programming language. Then, there's also the problem that just 
translating keywords may work for languages with the same sentence 
structure, but not for languages with a completely different sentence 
structure. On top of that, keywords are just a start; 
class/function/method names in libraries would have to be translated, 
too, which would be much more work (especially if one wants to do a good 
job).


Regards,Martin.


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-04 Thread Rebecca T via Unicode
I think that the benefits of inclusion from allowing non-ASCII identifiers
far outweigh any corner cases this might cause. (Although ironing out and
analyzing those is of course important, I don’t think they should be
obstacles for implementing this kind of thing.)

Something I’d love to see is translated keywords; shouldn’t be hard with a
line in the cargo.toml for a ruidmentary lookup. Again, I’m of the opinion
that an imperfect implementation is better than no attempt. I remember
reading an article about a professor who translated the keywords in...
maybe it was Python? And found their students were much more engaged with
the material. Anecdotal, of course, but it’s stuck with me.

On Mon, Jun 4, 2018 at 3:53 PM Manish Goregaokar via Unicode <
unicode@unicode.org> wrote:

> Hi,
>
> The Rust community is considering
>  adding non-ascii
> identifiers, which follow UAX #31 
> (XID_Start XID_Continue*, with tweaks). The proposal also asks for
> identifiers to be treated as equivalent under NFKC.
>
> Are there any cases where this will lead to inconsistencies? I.e. can the
> NFKC of a valid UAX 31 ident be invalid UAX 31?
>
> (In general, are there other problems folks see with this proposal?)
>
>
> Thanks,
> -Manish
>


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-04 Thread Richard Wordingham via Unicode
On Mon, 4 Jun 2018 12:49:20 -0700
Manish Goregaokar via Unicode  wrote:

> Hi,
> 
> The Rust community is considering
>  adding non-ascii
> identifiers, which follow UAX #31
>  (XID_Start XID_Continue*, with
> tweaks). The proposal also asks for identifiers to be treated as
> equivalent under NFKC.

> (In general, are there other problems folks see with this proposal?)

There's the usual lurking issue that the Thai word for water, น้ำ
, is unacceptable and often untypable and uncopiable
when converted to NFKC น้ํา  .  The decomposed form that
looks the same is นํ้า .  The problem
is that for sane results,  needs special handling.
This sequence is also often untypable - part of the protection against
Thai homographs.

Richard.



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-04 Thread Manish Goregaokar via Unicode
Oh, looks like UAX 31 has info on how to be closed under NFC

http://www.unicode.org/reports/tr31/#NFKC_Modifications

-Manish


On Mon, Jun 4, 2018 at 12:49 PM Manish Goregaokar 
wrote:

> Hi,
>
> The Rust community is considering
>  adding non-ascii
> identifiers, which follow UAX #31 
> (XID_Start XID_Continue*, with tweaks). The proposal also asks for
> identifiers to be treated as equivalent under NFKC.
>
> Are there any cases where this will lead to inconsistencies? I.e. can the
> NFKC of a valid UAX 31 ident be invalid UAX 31?
>
> (In general, are there other problems folks see with this proposal?)
>
>
> Thanks,
> -Manish
>