Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

Chris Barker Fri, 13 Mar 2020 20:36:52 -0700

@JimBiardCics wrote:

Actually, I know a LOT more about Python than I do about netcdf, HDF, or CF. 
And I'm afraid you have it a bit confused. This is kind of off-topic, but for 
clarities sake:


> Python 3 is not the same as python 2.

Very True, and a source of much confusion.

> In Python 2 there were two types — str (ASCII) and unicode (by default UTF-8).

Almost right: there were two types:

`str`: which was a single byte per character of unknown encoding -- essentially 
a wrapped char* -- usually ascii compatible, often latin-1, but not if you were 
Japanese, for instance.... It was also used a holder of arbitrary binary data: 
see numpy's "fromstring()" methods, or reading a binary file. Much like how 
char* is used in C.

`unicode`: which was unicode text -- stored internally in UCS-2 or UCS-4 
depending on how Python was compiled (I know, really?!?!) It could be encoded / 
decoded in various encodings for IO and interaction with other systems.

> In Python 3 there is only str, and by default it holds UTF-8 unicode

Almost right: the Py3 `str` type is indeed Unicode, but it holds a sequence of 
Unicode code points, which are internally stored in a dynamic encoding 
depending on the content of the string (really! a very cool optimization, 
actually, if you have only ascii text, it will use only one byte per char 
https://rushter.com/blog/python-strings-and-memory/ ). But all that is hidden 
from the user. To the user, a `str` is a sequence of characters from the entire 
Unicode set, very simply. 
 
(Unicode is particularly weird in that one "code point" is not always one 
character, or "grapheme" to accommodate languages with more complex systems of 
combining characters, etc, but I digress..)

And there are still two types -- in Python3 there is the "bytes" type, which is 
actually very similar to the old python2 string type -- but intended to hold 
arbitrary binary data, rather than text. But text is binary data, so it can 
still hold that. In fact, if you encode a string, you get a bytes object:

```
In [13]: s                                                                      
Out[13]: 'some text'

In [14]: b = s.encode("ascii")                                                  

In [15]: b                                                                      
Out[15]: b'some text'
```
Note the little 'b' before the quote. In that case, they look almost identical, 
as I encoded in ASCII. But what if I had some non-ASCII text?:

```
In [18]: s = "temp = 10\u00B0"                                                  

In [19]: s                                                                      
Out[19]: 'temp = 10°'

In [20]: b = s.encode("ascii")                                                  
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-20-3930abba6989> in <module>
----> 1 b = s.encode("ascii")

UnicodeEncodeError: 'ascii' codec can't encode character '\xb0' in position 9: 
ordinal not in range(128)
```

oops, can't do that -- the degree symbol is not part of ASCII. But I can do 
utf-8:

```
In [21]: b = s.encode("utf-8")                                                  

In [22]: b                                                                      
Out[22]: b'temp = 10\xc2\xb0'
```
which now displays the byte values, escaping the non-ascii ones. So that bytes 
object is what would get written to a netcdf file, or any other binary file.

And Python can just as easily encode that text in any supported encoding, of 
which there are many:

```
In [28]: s.encode("utf-16")                                                     
Out[28]: b'\xff\xfet\x00e\x00m\x00p\x00 \x00=\x00 \x001\x000\x00\xb0\x00'
```
But please don't use that one!

So anyway, the relevant point here is that there is NOTHING special about utf-8 
as far as Python is concerned. And in fact, Python is well suited to handle 
pretty much any encoding folks choose to use -- but it doesn't help a bit with 
the fundamental problem that you need to know what the encoding of your data is 
in in order to  use it. And if Python software (like any other) is going to 
write a netcdf file with non-ascii text in it, it needs to know what encoding 
to use.

The other complication that has come up here is that, IIUC, the netCDF4 Python 
library (A wrapper around the c libnetcdf) I think makes no distinction between 
the netcdf types CHAR and STRING (don't quote me on that), but that's a 
decision of the library authors, not a limitation of Python.

Actually, it does seem to give the user some control:

https://unidata.github.io/netcdf4-python/netCDF4/index.html#netCDF4.chartostring

Note that utf-8 is the default, but you can do whatever you want.

In any case, the Python libraries can be made to work with anything reasonable 
CF decides, even if I have to write the PRs myself :-)

Sorry to be so long winded, but this IS confusing stuff!


-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599005152
This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

Reply via email to