Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-18 Thread Klaus Zimmermann
Ah yes, I see what you mean, you are right: Always speaking about UTF-8, 
multi-byte here isn't referring to the possibility of having several bytes 
encode one code point, but to actual code points with more than one byte, thus 
excluding the one-byte code points which are exactly the first 128 ASCII 
characters. Then they allow back in specific ASCII characters.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600545623

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-17 Thread Chris Barker
> UTF-8 is only an encoding, so we should just say "unicode" for strings.

We could do that if and only if netcdf itself was clear about how Unicode is 
encoded in files. Which it is for variable names, though not so sure it is 
anywhere else.

But even so, once the encoding has been specified, then yes, talking about 
Unicode makes sense. 

Agreed, it's not for this discussion, but:

`MUTF8` is not quite (In that doc): "any unicode string encoded as normalized 
UTF-8." because I think they are specifically trying to exclude the ASCII 
subset, so they can handle that separately. i.e characters that are excluded, 
like "/" are indeed unicode strings.

But it's a pretty contorted way to describe it -- but that's netcdf's problem 
:-)




-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600128492

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-17 Thread Klaus Zimmermann
I agree and would go one small step further: UTF-8 is only an encoding, so we 
should just say "unicode" for strings. If we need to restrict that, say to 
disallow underscore in the beginning or to save a separation character like 
space in attributes right now, we should do so at the character level, possibly 
using categories as introduces by @ChrisBarker-NOAA above.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600067627

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-17 Thread JimBiardCics
@DocOtak @zklaus Sorry if I've pulled the discussion off track. The question of 
exactly why NUG worded things the way they did is intriguing, but I think Klaus 
is right that we shouldn't get wrapped around that particular axle in this 
issue — particularly if we are going to split encoding off into a different 
issue. I think the take-away is that our baseline is "sane utf-8 unicode" for 
attributes of type  NC_STRING and ASCII for attributes of type NC_CHAR (those 
created with the C function nc_put_att_text.)

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600065075
This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.


Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-17 Thread Klaus Zimmermann
I think there is some confusion here.

First, this whole regex stuff is only about the physical byte layout of the 
netcdf classic file format. I would in principle suggest to completely focus on 
netcdf4 files instead.

Second, I think CF should not concern itself with encodings and byte order 
stuff at all. Leave that to netcdf4/hdf5 and just work at the character level. 
And yes, unicode has code points, but also a concept of characters (see 
[here](https://en.wikipedia.org/wiki/Unicode#Architecture_and_terminology)).

Third, looking at the regex in question
```
([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F/\x7F-\xFF]|{MUTF8})*
```
notice that it is only an explanatory comment, but apart from that the 
overwhelmingly likely way to parse this, thanks to the "|" alternatives, is as 
either
```
([a-zA-Z0-9_])([^\x00-\x1F/\x7F-\xFF])*
```
ie an ascii string starting with a character, digit, or underscore, limited to 
the first 128 bytes without control characters and excluding "/" everywhere or
```
({MUTF8})({MUTF8})*
```
ie *any* unicode string encoded as normalized UTF-8.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599957114

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-16 Thread Chris Barker
remember that utf-8 is ascii compatible for the first 127 (7 bits). So:

0x00 to 0x1F are the control codes from ASCII

0x7f is the DEL (not sure why that wasn't in the first set..., but there you go.

and 0x80 to 0xFF is the rest of the non-ascii bytes -- (128-255), which you 
have to be able to use in order to do utf-8. But frankly, I have not sure what 
a regex is with regard to bytes. But if I had to guess, I'd pull it apart this 
way (which is almost what's in the footnote: 

first: MUTF8 means:  multibyte UTF-8 encoded, NFC-normalized Unicode character
However, Unicode doesn't quite use "characters", but rather "Code Points", so 
that means:

Which means any Unicode code point >= 128 (0x80) and above.

`([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F/\x7F-\xFF]|{MUTF8})`

The first character has to be:
([a-zA-Z0-9_]|{MUTF8}): ASCII letter, number or underscore OR any other code 
point over 128

All the other characters have to be:
Any code point other than: \x00-\x1F and \x7F-\xFF OR any code point above 128.

Which is an odd way to define it, as the codepoints \x7F-\xFF are valid 
Unicode, so you're kind of excluding them, and then allowing them again  
strange.

I suspect that this started with the original pre-Unicode definition, and they 
added the UTF8 part, and got an odd mixture. In particular, there is really no 
reason to treat the single byte or multibyte UTF codepoints separately, that's 
just odd.

I think I'd write this as:

Names are UTF-8 encoded.
The first letter can be any of these codepoints:
```
x30 - x39. (digits: 0-9)
x41 - x5a (upper case letters: A-Z)
x61 - x7a (lower case letters: a-z)
c5f (underscore)
>= xx80
```

```
The rest can be any code point other than:
\x00-\x1F or \x7F
```
However, there is a key missing piece: a number of Unicode code points are used 
for control character and whitespace, and probably other things unsuitable for 
names. Which may be why they used the term "character". But it would be better 
if they had clearly defined what's allowed and what
s not. For instance, Python3 uses these categories:
(https://docs.python.org/3/reference/lexical_analysis.html#identifiers)
Lu - uppercase letters
Ll - lowercase letters
Lt - titlecase letters
Lm - modifier letters
Lo - other letters
Nl - letter numbers

I have no idea if those are defined by the Unicode consortium anywhere. But it 
would be good for netcdf (and or CF) to define it for themselves.

I will say that it's kind of nifty to be able to do (in Python):

```
In [17]: π = math.pi
In [18]: area = π * r**2
```
But I'm not sure I need to be able to assign a variable to  -- which Python 
will not allow, but does the netcdf spec allow it?










-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599804396
This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.


Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-16 Thread JimBiardCics
I missed the regex. Yep, that's what it says. 0x7F is the "del" char, so it's 
non-printing. I think the characters from 0xC0 - 0xFF are out because they 
would all be interpreted in UTF-8 as signaling the start of a multi-byte 
character. 0x80 - 0xBF can all be interpreted as trailing elements of a 
multibyte character, so I guess it's a bad plan to have one lying around loose. 
[This Wikipedia article](https://en.wikipedia.org/wiki/UTF-8) was informative.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599791736

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-16 Thread Andrew Barna
@JimBiardCics It's the "not match" group in that regex that is doing it 
`([^\x00-\x1F/\x7F-\xFF]|{MUTF8})`, at least, I'm pretty sure that is what is 
going on. I rarely use regex myself, so I could be wrong, but I'm quite sure 
that the `^` is "not match".

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599756995

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-16 Thread JimBiardCics
@DocOtak I couldn't find the direct restriction on the 0x80 to 0xFF characters. 
Is this a side effect of utf-8 using the high bit to signal multibyte 
characters? Or is it a more general prohibition against using the characters in 
latin-1 that fall in that range?

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599755236

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-16 Thread Chris Barker
Thanks! Yup -- then attributes really do need to be UTF-8 and the STRING type 
(for text) only.

I suppose they don't ALL HAVE to be the STRING type, but the ones that might 
contain variable names should be.

after all, any software that doesn't support the STRING type probably doesn't 
support Unicode variable names, either ...



-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599728983

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-16 Thread Andrew Barna
Additionally, the netcdf standard itself has support for UTF-8 variable names, 
requires them to be 
[NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms), and 
specifically excludes bytes [0x00 to 0x1F and 0x7F to 
0xFF](https://www.unidata.ucar.edu/software/netcdf/docs/file_format_specifications.html)
 (see the "name" part of that document). 

I think this matters because at least one of the [standard attributes 
](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#ancillary-data)
 needs to be able to refer to variable names. Basically, allowing anything 
other than UTF-8, especially things that allow bytes 0x7F to 0xFF (like the 
ISO-8859 series encodings do), would probably cause actual problems.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599690108

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-16 Thread Chris Barker
> Also be aware that LATIN-1 is not compatible with UTF-8 with code points 
> above 127

Indeed. Which is why it should be clear that you should NOT put utf-8 in a CHAR 
array :-) We could say ASCII only for CHAR, but I'm not sure there is a good 
reason to  be that restrictive.

It may be a implementation detail of the Python encodings, but at least there, 
latin-1 can decode ANY string of bytes (Other the the null byte) without error, 
and write it out again with no changes. So if consuming code uses the latin-1 
encoding for all CHAR arrays, it may get garbage for the non-ascii bytes, but 
it won't raise an error, or mangle the data if it is written back out.

> the netcdf python library will force the use of strings for netcdf4 files if 
> it sees unicode points outside of ASCII.

which is the right thing to do, and compatible with this proposal, I think. 
(hmm, unless latin-1 is allowed). But you could probably send a latin-1 encoded 
bytes object in yes?

Anyway, if we codify this, and the netCDF4 lib (or any other) can't support it, 
it can be fixed. And yes, I am volunteering to do a PR for a fix to 
netCDF4-python.


-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599681959

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-16 Thread Chris Barker
> we should, in this specific issue, propose making changes to the CF document 
> to make it clear that CHAR attributes must be ASCII or latin-1 and STRING 
> attributes should be unicode/utf-8

+1 on that.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599662220

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-16 Thread JimBiardCics
@ChrisBarker-NOAA 
* The sub-issues haven't been split off.
* This issue is about attributes only. 
* The difference between CHAR and STRING for an attribute is, essentially, 
which type you pick when you create the attribute. With CHAR you can't specify 
arrays for the value (because it already is a CHAR array) and with STRING you 
can.

Assuming we do spin off sub-issues related to encoding and string array 
attributes, I agree fully that we should, in this specific issue, propose 
making changes to the CF document to make it clear that CHAR attributes must be 
ASCII or latin-1 and STRING attributes should be unicode/utf-8.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599603557

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-15 Thread Chris Barker
> My original observation was that we can absolutely split off some of these 
> issues.
Agreed. 

Have these been started? I can't find them if they have.

There is also the question of what to do with CHAR types -- the same as STRING?

And what about encoding of CHAR and STRING variables? I can't find anything 
about that in the current CF document, so it doesn't seem to be settled.

Maybe this should go in a new issue, but for now, I had a (not well formed) 
thought:

CHAR variables and attributes should only be encoded in a 1-byte per character 
ascii compatible encoding: e.g. ascii, latin-1

STRING variables and attributes should only be encoded in utf-8 (of which ascii 
is a subset)

My justification is that there will be little software in the wild that 
supports Unicode, but does not support String. Setting this standard will make 
it less likely that older software that assumes a 1byte per character text 
representation will get handed something it can't deal with. And the string 
type is better suited to  Unicode anyway, as the "length" of a string is less 
well defined.


-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599262170

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-15 Thread JimBiardCics
@ChrisBarker-NOAA My original observation was that we can absolutely split off 
some of these issues. I see two issues being peeled off from the base issue.
* Define a convention for attributes with multiple strings (string array 
attributes).
* Determine what to do (or not do) regarding different encodings in string 
attributes.

I think you've made a strong case for starting out by specifying ASCII and 
Unicode / UTF-8 as the only valid contents for string attributes, with one of 
the two spinoff issues addressing the question of broadening the options.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599211543

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-14 Thread Chris Barker
one small additional nota about Python and Unicode:

The post Jim pointed us to:

https://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/

Is now six years old -- and many of the issues brought up have been addressed.

And the the author of that post has another post on the dangers of refereeing 
back to such older opinions:

https://lucumr.pocoo.org/2016/11/5/be-careful-about-what-you-dislike/

Another issue with that discussion is that it's written from the perspective of 
what some folks in the community are calling "byte slingers": Those that write 
libraries and the like that deal with binary data and protocols. And the fact 
is that Python3's String model is NOT as well suited to those use cases. But it 
is massively better suited to most more "casual" use cases. In that post, he 
refers to "beginners", but it's not beginners, it's anyone that does not 
understand the subtleties of binary data, encodings, and the like. Which is 
most of us "scientific programers".

Bringing this back to CF: For CF, ideally we would choose an approach that is 
well suited to the "Normal scientific programmer", and leave the 
encoding/decoding to the libraries.  And have confidence that the  "byte 
slingers" will correctly write the libraries to match the standard, and make 
things "just work" for most users.




-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599107553

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread Chris Barker
@JimBiardCics wrote:

Actually, I know a LOT more about Python than I do about netcdf, HDF, or CF. 
And I'm afraid you have it a bit confused. This is kind of off-topic, but for 
clarities sake:

> Python 3 is not the same as python 2.

Very True, and a source of much confusion.

> In Python 2 there were two types — str (ASCII) and unicode (by default UTF-8).

Almost right: there were two types:

`str`: which was a single byte per character of unknown encoding -- essentially 
a wrapped char* -- usually ascii compatible, often latin-1, but not if you were 
Japanese, for instance It was also used a holder of arbitrary binary data: 
see numpy's "fromstring()" methods, or reading a binary file. Much like how 
char* is used in C.

`unicode`: which was unicode text -- stored internally in UCS-2 or UCS-4 
depending on how Python was compiled (I know, really?!?!) It could be encoded / 
decoded in various encodings for IO and interaction with other systems.

> In Python 3 there is only str, and by default it holds UTF-8 unicode

Almost right: the Py3 `str` type is indeed Unicode, but it holds a sequence of 
Unicode code points, which are internally stored in a dynamic encoding 
depending on the content of the string (really! a very cool optimization, 
actually, if you have only ascii text, it will use only one byte per char 
https://rushter.com/blog/python-strings-and-memory/ ). But all that is hidden 
from the user. To the user, a `str` is a sequence of characters from the entire 
Unicode set, very simply. 
 
(Unicode is particularly weird in that one "code point" is not always one 
character, or "grapheme" to accommodate languages with more complex systems of 
combining characters, etc, but I digress..)

And there are still two types -- in Python3 there is the "bytes" type, which is 
actually very similar to the old python2 string type -- but intended to hold 
arbitrary binary data, rather than text. But text is binary data, so it can 
still hold that. In fact, if you encode a string, you get a bytes object:

```
In [13]: s  
Out[13]: 'some text'

In [14]: b = s.encode("ascii")  

In [15]: b  
Out[15]: b'some text'
```
Note the little 'b' before the quote. In that case, they look almost identical, 
as I encoded in ASCII. But what if I had some non-ASCII text?:

```
In [18]: s = "temp = 10\u00B0"  

In [19]: s  
Out[19]: 'temp = 10°'

In [20]: b = s.encode("ascii")  
---
UnicodeEncodeErrorTraceback (most recent call last)
 in 
> 1 b = s.encode("ascii")

UnicodeEncodeError: 'ascii' codec can't encode character '\xb0' in position 9: 
ordinal not in range(128)
```

oops, can't do that -- the degree symbol is not part of ASCII. But I can do 
utf-8:

```
In [21]: b = s.encode("utf-8")  

In [22]: b  
Out[22]: b'temp = 10\xc2\xb0'
```
which now displays the byte values, escaping the non-ascii ones. So that bytes 
object is what would get written to a netcdf file, or any other binary file.

And Python can just as easily encode that text in any supported encoding, of 
which there are many:

```
In [28]: s.encode("utf-16") 
Out[28]: b'\xff\xfet\x00e\x00m\x00p\x00 \x00=\x00 \x001\x000\x00\xb0\x00'
```
But please don't use that one!

So anyway, the relevant point here is that there is NOTHING special about utf-8 
as far as Python is concerned. And in fact, Python is well suited to handle 
pretty much any encoding folks choose to use -- but it doesn't help a bit with 
the fundamental problem that you need to know what the encoding of your data is 
in in order to  use it. And if Python software (like any other) is going to 
write a netcdf file with non-ascii text in it, it needs to know what encoding 
to use.

The other complication that has come up here is that, IIUC, the netCDF4 Python 
library (A wrapper around the c libnetcdf) I think makes no distinction between 
the netcdf types CHAR and STRING (don't quote me on that), but that's a 
decision of the library authors, not a limitation of Python.

Actually, it does seem to give the user some control:

https://unidata.github.io/netcdf4-python/netCDF4/index.html#netCDF4.chartostring

Note that utf-8 is the default, but you can do whatever you want.

In any case, the Python libraries can be made to work with anything reasonable 
CF decides, even if I have to write the PRs myself :-)

Sorry to be so long winded, but this IS confusing stuff!


-- 
You are receiving this 

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread Chris Barker
I'm getting double messages -- I think we may have a feedback loop between 
gitHub and the list .

But anyway:

> Hmmm. Chris, I think you are implying a problem that does not exist.


I hope that's true, Sorry if I stirred up confusion.

But I was responding to a comment about ASCII vs UTF-8, so 

I also picked this up in email, so was unsure of the context. I've now gone and 
re-read the issue, and I:m a bit confused about what's still on the table.

But way back, someone wrote:
" two issues: the use of strings, and the encoding. These can be decided 
separately, can't they?"

and there was another one: arrays of strings vs whitespace separated strings.

(I'm also not completely clear about the difference between a char* and a 
string anyway. Either way, it's a bunch of bytes that need to be interpreted)

So I'll just talk about encoding here. A few points:

(I know you all know most of this, and most of it has been stated in this 
thread, but to put it all in one place...)

* Encodings are a nightmare: any place that a pile of bytes could be in more 
than one encoding is a pain in the a$$ for any client software -- think about 
the earlier days of html!

* Being able to use non-ASCII characters is important and unavoidable. We can 
certainly restrict CF names to ASCII, but it's simply not an option for 
variables or attributes. (I don't think anyone is suggesting that anyway) and 
Unicode is the obvious way to support that.

So that leaves one open question: what encoding(s) are allowed for a CF 
compliant file?

I'm going to be direct here:

THERE IS NO REASON TO ALLOW MORE THAN ONE ENCODING

It only leads to pain. Period. End of story. If there is one allowed encoding, 
then all CF compliant software will have to be able to encode/decode that 
encoding. But ONLY that one! If we allow multiple encodings, than to be fully 
compliant, all software would have to encode/decode a wide range of encodings, 
and there would have to be a way to specify the encoding. So all software would 
have to be more complex, and there would be a lot more room for error.

If there is only one encoding allowed, then there are really only two options: 

UCS-4: because it handles all of Unicode and is the always the same number of 
bytes per code point. A lot more like the old char* days. However, no one wants 
to waste all that disk space, so that leaves:

UTF-8: which is ASCII compatible, handles all of Unicode, and has been almost 
universally adopted in most internet exchange formats (those that are sane 
enough to specify a single encoding :-) )

It is also friendly to older software that uses null-terminated char* and the 
like, so even old code will probably not break, even if it does misinterpret 
the non-ascii bytes. And old software that writes plain ascii will also work 
fine, as ascii ID utf-8.

All that's a long way of saying:

CF should specify UTF-* as the only correct encoding for all text: char or 
string. With possibly some extra restrictions to ASCII in some contexts.

If that had already been decided, then sorry for the noise :-)

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599001824
This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.


Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread JimBiardCics
Chris,

Python 3 is not the same as python 2. In Python 2 there were two types — str 
(ASCII) and unicode (by default UTF-8). In Python 3 there is only str, and by 
default it holds UTF-8 unicode (there's lots of subtly that I'm glossing over 
here, but this is what it boils down to). It bit me recently, so I'm sensitive 
to it.

https://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/
https://docs.python.org/3/howto/unicode.html

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598958361
This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.


Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread Dave Allured
Hmmm.  Chris, I think you are implying a problem that does not exist.  I do not 
think CF has ever restricted the use of UTF-8 in free text within attributes.  
I suspect there are many UTF-8 attribute examples in the wild, though I do not 
have one up my sleeve right now.  Please correct me if I'm wrong.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598956209

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread JimBiardCics
@Dave-Allured I approve of your proposal. I think we pretty much have no choice 
but to allow UTF-8 as a baseline to start with, but there clearly are larger 
issues to be resolved. (I say "no choice" because, for example, constraining to 
ASCII in python 3 is a bit complicated.)

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598914012

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread Andrew Barna
@Dave-Allured That sounds OK to me, whatever does get adopted, should probably 
be pretty explicit about what is "not allowed".

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598912616

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread Dave Allured
@DocOtak, thanks for restarting this.  In light of past difficulties, I move to 
split the issue.

I think it would be possible to break out the more difficult parts of this 
topic into new and separate issues.  I suggest that this issue #141 be narrowed 
to only a single essential ingredient:  scalar string-type attributes as an 
alternative for traditional character-type attributes.

Can we agree to move the following to new Github issues, and focus for now only 
on legalizing scalar string-type attributes?

* Array string-type attributes.
* All discussion of character sets, UTF-8 or otherwise.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598909876

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread Andrew Barna
So now that [string variables have 
landed](https://github.com/cf-convention/cf-conventions/pull/140), I want to 
bring some attention to this issue again. Some updates I've learned about:

* Because strings are allowed in variables, some are assuming that string 
attributes are allowed by CF 1.8, especially because [Appendix 
A](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#attribute-appendix)
 uses the word "string" instead of... text? whatever `nc_get_att_text()` 
returns...
* This is now being used as a "CF-1.8" conforming recommendation from 
[PODAAC](https://podaac.jpl.nasa.gov/) for [pinniped 
data](https://github.com/oceandatainterop/nc-eTAG/blob/master/nc-eTAG_Archival_Template.cdl).
* I personally see use of string arrays as having extreme utility for lists of 
peoples names (mostly ACDD attrs), flag definitions, and anything else that is 
currently "white space separated" in the CF standard.


-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598851406

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.