Hi.

Attributes with a type of string are now possible with netCDF-4, and many examples of attributes with this type are "in the wild". As an example of how this is happening, IDL creates an attribute with this type if you select its version of **`string`** type instead of **`char`** type. It seems that people often assume that **`string`** is the correct type to use because they wish to store strings, not characters.

I propose to add verbiage to the Conventions to allow attributes that have a type of **`string`**. There are two ramifications to allowing attributes of this type, the second of which impacts string variables as well.

1. A **`string`** attribute can contain 1D atomic string arrays. We need to decide whether or not we want to allow these or limit them (at least for now) to atomic string scalars. Attributes with arrays of strings could allow for cleaner delimiting of multiple parts than spaces or commas do now (e.g. flag_values and flag_meanings could both be arrays), but this would be a significant stretch for current software packages. 2. A **`string`** attribute (and a **`string`** variable) can contain UTF-8 Unicode strings. UTF-8 uses variable-length characters, with the standard ASCII characters as the 1-byte subset. According to the Unicode standard, a UTF-8 string can be signaled by the presence of a special non-printing three byte sequence known as a Byte Order Mark (BOM) at the front of the string, although this is not required. IDL (again, for example) writes this BOM sequence at the beginning of every attribute or variable element of type **`string`**.

Allowing attributes containing arrays of strings may open up useful future directions, but it will be more of a break from the past than attributes that have only single strings. Allowing attributes (and variables) to contain UTF-8 will free people to store non-English content, but it might pose headaches for software written in older languages such as C and FORTRAN.

To finalize the change to support **`string`** type attributes, we need to decide:

1. Do we explicitly forbid string array attributes?
2. Do we place any restrictions on the content of **`string`** attributes and (by extension) variables?

Now that I have the background out of the way, here's my proposal.

Allow **`string`** attributes. Specify that the attributes defined by the current CF Conventions must be scalar (contain only one string).

Allow UTF-8 in attribute and variable values. Specify that the current CF Conventions use only ASCII characters (which are a subset of UTF-8) for all terms defined within. That is, the controlled vocabulary of CF (standard names and extensions, cell_methods terms other than free-text elements of comments(?), area type names, time units, etc) is composed entirely of ASCII characters. Free-text elements (comments, long names, flag_meanings, etc) may use any UTF-8 character.

Github issue: #141 <https://github.com/cf-convention/cf-conventions/issues/141>
Trac ticket: #176 <https://cf-trac.llnl.gov/trac/ticket/176#ticket>

Grace and peace,

Jim

--
CICS-NC <http://www.cicsnc.org/> Visit us on
Facebook <http://www.facebook.com/cicsnc>         *Jim Biard*
*Research Scholar*
Cooperative Institute for Climate and Satellites NC <http://cicsnc.org/>
North Carolina State University <http://ncsu.edu/>
NOAA National Centers for Environmental Information <http://ncdc.noaa.gov/>
/formerly NOAA’s National Climatic Data Center/
151 Patton Ave, Asheville, NC 28801
e: [email protected] <mailto:[email protected]>
o: +1 828 271 4900

/Connect with us on Facebook for climate <https://www.facebook.com/NOAANCEIclimate> and ocean and geophysics <https://www.facebook.com/NOAANCEIoceangeo> information, and follow us on Twitter at @NOAANCEIclimate <https://twitter.com/NOAANCEIclimate> and @NOAANCEIocngeo <https://twitter.com/NOAANCEIocngeo>. /


_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Reply via email to