Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread Chris Barker
@JimBiardCics wrote:

Actually, I know a LOT more about Python than I do about netcdf, HDF, or CF. 
And I'm afraid you have it a bit confused. This is kind of off-topic, but for 
clarities sake:

> Python 3 is not the same as python 2.

Very True, and a source of much confusion.

> In Python 2 there were two types — str (ASCII) and unicode (by default UTF-8).

Almost right: there were two types:

`str`: which was a single byte per character of unknown encoding -- essentially 
a wrapped char* -- usually ascii compatible, often latin-1, but not if you were 
Japanese, for instance It was also used a holder of arbitrary binary data: 
see numpy's "fromstring()" methods, or reading a binary file. Much like how 
char* is used in C.

`unicode`: which was unicode text -- stored internally in UCS-2 or UCS-4 
depending on how Python was compiled (I know, really?!?!) It could be encoded / 
decoded in various encodings for IO and interaction with other systems.

> In Python 3 there is only str, and by default it holds UTF-8 unicode

Almost right: the Py3 `str` type is indeed Unicode, but it holds a sequence of 
Unicode code points, which are internally stored in a dynamic encoding 
depending on the content of the string (really! a very cool optimization, 
actually, if you have only ascii text, it will use only one byte per char 
https://rushter.com/blog/python-strings-and-memory/ ). But all that is hidden 
from the user. To the user, a `str` is a sequence of characters from the entire 
Unicode set, very simply. 
 
(Unicode is particularly weird in that one "code point" is not always one 
character, or "grapheme" to accommodate languages with more complex systems of 
combining characters, etc, but I digress..)

And there are still two types -- in Python3 there is the "bytes" type, which is 
actually very similar to the old python2 string type -- but intended to hold 
arbitrary binary data, rather than text. But text is binary data, so it can 
still hold that. In fact, if you encode a string, you get a bytes object:

```
In [13]: s  
Out[13]: 'some text'

In [14]: b = s.encode("ascii")  

In [15]: b  
Out[15]: b'some text'
```
Note the little 'b' before the quote. In that case, they look almost identical, 
as I encoded in ASCII. But what if I had some non-ASCII text?:

```
In [18]: s = "temp = 10\u00B0"  

In [19]: s  
Out[19]: 'temp = 10°'

In [20]: b = s.encode("ascii")  
---
UnicodeEncodeErrorTraceback (most recent call last)
 in 
> 1 b = s.encode("ascii")

UnicodeEncodeError: 'ascii' codec can't encode character '\xb0' in position 9: 
ordinal not in range(128)
```

oops, can't do that -- the degree symbol is not part of ASCII. But I can do 
utf-8:

```
In [21]: b = s.encode("utf-8")  

In [22]: b  
Out[22]: b'temp = 10\xc2\xb0'
```
which now displays the byte values, escaping the non-ascii ones. So that bytes 
object is what would get written to a netcdf file, or any other binary file.

And Python can just as easily encode that text in any supported encoding, of 
which there are many:

```
In [28]: s.encode("utf-16") 
Out[28]: b'\xff\xfet\x00e\x00m\x00p\x00 \x00=\x00 \x001\x000\x00\xb0\x00'
```
But please don't use that one!

So anyway, the relevant point here is that there is NOTHING special about utf-8 
as far as Python is concerned. And in fact, Python is well suited to handle 
pretty much any encoding folks choose to use -- but it doesn't help a bit with 
the fundamental problem that you need to know what the encoding of your data is 
in in order to  use it. And if Python software (like any other) is going to 
write a netcdf file with non-ascii text in it, it needs to know what encoding 
to use.

The other complication that has come up here is that, IIUC, the netCDF4 Python 
library (A wrapper around the c libnetcdf) I think makes no distinction between 
the netcdf types CHAR and STRING (don't quote me on that), but that's a 
decision of the library authors, not a limitation of Python.

Actually, it does seem to give the user some control:

https://unidata.github.io/netcdf4-python/netCDF4/index.html#netCDF4.chartostring

Note that utf-8 is the default, but you can do whatever you want.

In any case, the Python libraries can be made to work with anything reasonable 
CF decides, even if I have to write the PRs myself :-)

Sorry to be so long winded, but this IS confusing stuff!


-- 
You are receiving this 

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread Chris Barker
I'm getting double messages -- I think we may have a feedback loop between 
gitHub and the list .

But anyway:

> Hmmm. Chris, I think you are implying a problem that does not exist.


I hope that's true, Sorry if I stirred up confusion.

But I was responding to a comment about ASCII vs UTF-8, so 

I also picked this up in email, so was unsure of the context. I've now gone and 
re-read the issue, and I:m a bit confused about what's still on the table.

But way back, someone wrote:
" two issues: the use of strings, and the encoding. These can be decided 
separately, can't they?"

and there was another one: arrays of strings vs whitespace separated strings.

(I'm also not completely clear about the difference between a char* and a 
string anyway. Either way, it's a bunch of bytes that need to be interpreted)

So I'll just talk about encoding here. A few points:

(I know you all know most of this, and most of it has been stated in this 
thread, but to put it all in one place...)

* Encodings are a nightmare: any place that a pile of bytes could be in more 
than one encoding is a pain in the a$$ for any client software -- think about 
the earlier days of html!

* Being able to use non-ASCII characters is important and unavoidable. We can 
certainly restrict CF names to ASCII, but it's simply not an option for 
variables or attributes. (I don't think anyone is suggesting that anyway) and 
Unicode is the obvious way to support that.

So that leaves one open question: what encoding(s) are allowed for a CF 
compliant file?

I'm going to be direct here:

THERE IS NO REASON TO ALLOW MORE THAN ONE ENCODING

It only leads to pain. Period. End of story. If there is one allowed encoding, 
then all CF compliant software will have to be able to encode/decode that 
encoding. But ONLY that one! If we allow multiple encodings, than to be fully 
compliant, all software would have to encode/decode a wide range of encodings, 
and there would have to be a way to specify the encoding. So all software would 
have to be more complex, and there would be a lot more room for error.

If there is only one encoding allowed, then there are really only two options: 

UCS-4: because it handles all of Unicode and is the always the same number of 
bytes per code point. A lot more like the old char* days. However, no one wants 
to waste all that disk space, so that leaves:

UTF-8: which is ASCII compatible, handles all of Unicode, and has been almost 
universally adopted in most internet exchange formats (those that are sane 
enough to specify a single encoding :-) )

It is also friendly to older software that uses null-terminated char* and the 
like, so even old code will probably not break, even if it does misinterpret 
the non-ascii bytes. And old software that writes plain ascii will also work 
fine, as ascii ID utf-8.

All that's a long way of saying:

CF should specify UTF-* as the only correct encoding for all text: char or 
string. With possibly some extra restrictions to ASCII in some contexts.

If that had already been decided, then sorry for the noise :-)

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599001824
This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.


Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread JimBiardCics
Chris,

Python 3 is not the same as python 2. In Python 2 there were two types — str 
(ASCII) and unicode (by default UTF-8). In Python 3 there is only str, and by 
default it holds UTF-8 unicode (there's lots of subtly that I'm glossing over 
here, but this is what it boils down to). It bit me recently, so I'm sensitive 
to it.

https://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/
https://docs.python.org/3/howto/unicode.html

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598958361
This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.


Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread Dave Allured
Hmmm.  Chris, I think you are implying a problem that does not exist.  I do not 
think CF has ever restricted the use of UTF-8 in free text within attributes.  
I suspect there are many UTF-8 attribute examples in the wild, though I do not 
have one up my sleeve right now.  Please correct me if I'm wrong.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598956209

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread JimBiardCics
@Dave-Allured I approve of your proposal. I think we pretty much have no choice 
but to allow UTF-8 as a baseline to start with, but there clearly are larger 
issues to be resolved. (I say "no choice" because, for example, constraining to 
ASCII in python 3 is a bit complicated.)

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598914012

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread Andrew Barna
@Dave-Allured That sounds OK to me, whatever does get adopted, should probably 
be pretty explicit about what is "not allowed".

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598912616

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread Dave Allured
@DocOtak, thanks for restarting this.  In light of past difficulties, I move to 
split the issue.

I think it would be possible to break out the more difficult parts of this 
topic into new and separate issues.  I suggest that this issue #141 be narrowed 
to only a single essential ingredient:  scalar string-type attributes as an 
alternative for traditional character-type attributes.

Can we agree to move the following to new Github issues, and focus for now only 
on legalizing scalar string-type attributes?

* Array string-type attributes.
* All discussion of character sets, UTF-8 or otherwise.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598909876

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-13 Thread Andrew Barna
So now that [string variables have 
landed](https://github.com/cf-convention/cf-conventions/pull/140), I want to 
bring some attention to this issue again. Some updates I've learned about:

* Because strings are allowed in variables, some are assuming that string 
attributes are allowed by CF 1.8, especially because [Appendix 
A](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#attribute-appendix)
 uses the word "string" instead of... text? whatever `nc_get_att_text()` 
returns...
* This is now being used as a "CF-1.8" conforming recommendation from 
[PODAAC](https://podaac.jpl.nasa.gov/) for [pinniped 
data](https://github.com/oceandatainterop/nc-eTAG/blob/master/nc-eTAG_Archival_Template.cdl).
* I personally see use of string arrays as having extreme utility for lists of 
peoples names (mostly ACDD attrs), flag definitions, and anything else that is 
currently "white space separated" in the CF standard.


-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598851406

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Updating definition of coordinate variable to account for NUG changes (#174)

2020-03-13 Thread taylor13
Yes, thanks for moderating, David.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/174#issuecomment-598784783

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Updating definition of coordinate variable to account for NUG changes (#174)

2020-03-13 Thread Martin
Thanks David, I believe that would be very helpful. We have agreed to change it 
from a `defect` to an `enhancement`, but I don't appear to have the permission 
needed to effect that change, so please make that switch if you can.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/174#issuecomment-598677580

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Updating definition of coordinate variable to account for NUG changes (#174)

2020-03-13 Thread David Hassell
As far as I can tell this issue has no moderator as yet. I would be happy to 
take this on, if everyone else is OK with that. I will try to collate a summary 
of the points raised, sometime (hopefully early) next week.

Thanks, David

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/174#issuecomment-598664118

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Updating definition of coordinate variable to account for NUG changes (#174)

2020-03-13 Thread Klaus Zimmermann
Speaking as someone that has been trying to make sense of very diverse CF files 
with nothing but the CF-Convention in my hand, I have to say the fact that 
dimension coordinates can be identified by name and dimension being the same is 
a good thing.

It is very hard to correctly identify, for example, auxiliary coordinates and 
cell_measures because this status can not be inferred from the variables 
themselves, but only from analyzing all relevant variables. This is possible in 
an ad-hoc fashion, but hard to implement in a parser. It becomes harder when 
"all relevant variables" might be spread over several files or exist only in an 
object storage or similar.

Generally, the convention does a good job of telling people with data how to 
put this into netcdf files. It is far more difficult to work with in the other 
direction.
Keeping dimensional variables easily identifiable is a good step in that 
direction and so I personally support strongly to forbid `string x(x):`.

In fact, I would like to see CF move in a direction where it becomes easier to 
identify the character of all variables, but that is a discussion for another 
day.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/174#issuecomment-598624346

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.