Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

Chris Barker Mon, 16 Mar 2020 16:34:23 -0700

remember that utf-8 is ascii compatible for the first 127 (7 bits). So:

0x00 to 0x1F are the control codes from ASCII


0x7f is the DEL (not sure why that wasn't in the first set..., but there you go.

and 0x80 to 0xFF is the rest of the non-ascii bytes -- (128-255), which you 
have to be able to use in order to do utf-8. But frankly, I have not sure what 
a regex is with regard to bytes. But if I had to guess, I'd pull it apart this 
way (which is almost what's in the footnote: 

first: MUTF8 means:  multibyte UTF-8 encoded, NFC-normalized Unicode character
However, Unicode doesn't quite use "characters", but rather "Code Points", so 
that means:

Which means any Unicode code point >= 128 (0x80) and above.

`([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F/\x7F-\xFF]|{MUTF8})`

The first character has to be:
([a-zA-Z0-9_]|{MUTF8}): ASCII letter, number or underscore OR any other code 
point over 128

All the other characters have to be:
Any code point other than: \x00-\x1F and \x7F-\xFF OR any code point above 128.

Which is an odd way to define it, as the codepoints \x7F-\xFF are valid 
Unicode, so you're kind of excluding them, and then allowing them again .... 
strange.

I suspect that this started with the original pre-Unicode definition, and they 
added the UTF8 part, and got an odd mixture. In particular, there is really no 
reason to treat the single byte or multibyte UTF codepoints separately, that's 
just odd.

I think I'd write this as:

Names are UTF-8 encoded.
The first letter can be any of these codepoints:
```
x30 - x39. (digits: 0-9)
x41 - x5a (upper case letters: A-Z)
x61 - x7a (lower case letters: a-z)
c5f (underscore)
>= xx80
```

```
The rest can be any code point other than:
\x00-\x1F or \x7F
```
However, there is a key missing piece: a number of Unicode code points are used 
for control character and whitespace, and probably other things unsuitable for 
names. Which may be why they used the term "character". But it would be better 
if they had clearly defined what's allowed and what
s not. For instance, Python3 uses these categories:
(https://docs.python.org/3/reference/lexical_analysis.html#identifiers)
Lu - uppercase letters
Ll - lowercase letters
Lt - titlecase letters
Lm - modifier letters
Lo - other letters
Nl - letter numbers

I have no idea if those are defined by the Unicode consortium anywhere. But it 
would be good for netcdf (and or CF) to define it for themselves.

I will say that it's kind of nifty to be able to do (in Python):

```
In [17]: π = math.pi                                                            
In [18]: area = π * r**2
```
But I'm not sure I need to be able to assign a variable to 💩 -- which Python 
will not allow, but does the netcdf spec allow it?










-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599804396
This list forwards relevant notifications from Github.  It is distinct from 
[email protected], although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
[email protected].

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

Reply via email to