On Wed, Jun 13, 2018 at 2:49 PM Mark Davis ☕️ <span> wrote: > > > That is, why is conforming to UAX #31 worth the risk of prohibiting the use > > of characters that some users might want to use? > > One could parse for certain sequences, putting characters into a number of > broad categories. Very approximately: > > junk ~= [[:cn:][:cs:][:co:]]+ > whitespace ~= [[:z:][:c:]-junk]+ > syntax ~= [[:s:][:p:]] // broadly speaking, including both the language > syntax & user-named operators > identifiers ~= [all-else]+ > > UAX #31 specifies several different kinds of identifiers, and takes roughly > that approach for > http://unicode.org/reports/tr31/#Immutable_Identifier_Syntax, although the > focus there is on immutability. > > So an implementation could choose to follow that course, rather than the more > narrowly defined identifiers in > http://unicode.org/reports/tr31/#Default_Identifier_Syntax. Alternatively, > one can conform to the Default Identifiers but declare a profile that expands > the allowable characters. One could take a Swiftian approach, for example...
Thank you and sorry about my slow reply. Why is excluding junk important? > On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode <span> wrote: >> >> On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen <span> wrote: >> > Considering that ruling out too much can be a problem later, but just >> > treating anything above ASCII as opaque hasn't caused trouble (that I >> > know of) for HTML other than compatibility issues with XML's stricter >> > stance, why should a programming language, if it opts to support >> > non-ASCII identifiers in an otherwise ASCII core syntax, implement the >> > complexity of UAX #31 instead of allowing everything above ASCII in >> > identifiers? In other words, what problem does making a programming >> > language conform to UAX #31 solve? >> >> After refreshing my memory of XML history, I realize that mentioning >> XML does not helpfully illustrate my question despite the mention of >> XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please >> ignore the XML part. >> >> Trying to rephrase my question more clearly: >> >> Let's assume that we are designing a computer-parseable syntax where >> tokens consisting of user-chosen characters can't occur next to each >> other and, instead, always have some syntax-reserved characters >> between them. That is, I'm talking about syntaxes that look like this >> (could be e.g. Java): >> >> ab.cd(); >> >> Here, ab and cd are tokens with user-chosen characters whereas space >> (the indent), period, parenthesis and the semicolon are >> syntax-reserved. We know that ab and cd are distinct tokens, because >> there is a period between them, and we know the opening parethesis >> ends the cd token. >> >> To illustrate what I'm explicitly _not_ talking about, I'm not talking >> about a syntax like this: >> >> αβ⊗γδ >> >> Here αβ and γδ are user-named variable names and ⊗ is a user-named >> operator and the distinction between different kinds of user-named >> tokens has to be known somehow in order to be able to tell that there >> are three distinct tokens: αβ, ⊗, and γδ. >> >> My question is: >> >> When designing a syntax where tokens with the user-chosen characters >> can't occur next to each other without some syntax-reserved characters >> between them, what advantages are there from limiting the user-chosen >> characters according to UAX #31 as opposed to treating any character >> that is not a syntax-reserved character as a character that can occur >> in user-named tokens? >> >> I understand that taking the latter approach allows users to mint >> tokens that on some aesthetic measure don't make sense (e.g. minting >> tokens that consist of glyphless code points), but why is it important >> to prescribe that this is prohibited as opposed to just letting users >> choose not to mint tokens that are inconvenient for them to work with >> given the behavior that their plain text editor gives to various >> characters? That is, why is conforming to UAX #31 worth the risk of >> prohibiting the use of characters that some users might want to use? >> The introduction of XID after ID and the introduction of Extended >> Hashtag Identifiers after XID is indicative of over-restriction having >> been a problem. >> >> Limiting user-minted tokens to UAX #31 does not appear to be necessary >> for security purposes considering that HTML and CSS exist in a >> particularly adversarial environment and get away with taking the >> approach that any character that isn't a syntax-reserved character is >> collected as part of a user-minted identifier. (Informally, both treat >> non-ASCII characters the same as an ASCII underscore. HTML even treats >> non-whitespace, non-U+0000 ASCII controls that way.) >> >> -- >> Henri Sivonen >> hsivo...@hsivonen.fi >> https://hsivonen.fi/ >> > -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/</span></span></span>