I think there is some confusion here.

First, this whole regex stuff is only about the physical byte layout of the 
netcdf classic file format. I would in principle suggest to completely focus on 
netcdf4 files instead.

Second, I think CF should not concern itself with encodings and byte order 
stuff at all. Leave that to netcdf4/hdf5 and just work at the character level. 
And yes, unicode has code points, but also a concept of characters (see 
[here](https://en.wikipedia.org/wiki/Unicode#Architecture_and_terminology)).

Third, looking at the regex in question
```
([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F/\x7F-\xFF]|{MUTF8})*
```
notice that it is only an explanatory comment, but apart from that the 
overwhelmingly likely way to parse this, thanks to the "|" alternatives, is as 
either
```
([a-zA-Z0-9_])([^\x00-\x1F/\x7F-\xFF])*
```
ie an ascii string starting with a character, digit, or underscore, limited to 
the first 128 bytes without control characters and excluding "/" everywhere or
```
({MUTF8})({MUTF8})*
```
ie *any* unicode string encoded as normalized UTF-8.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599957114

This list forwards relevant notifications from Github.  It is distinct from 
[email protected], although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
[email protected].

Reply via email to