#159: charset attribute
-----------------------------+------------------------------
Reporter: bob.simons | Owner: cf-conventions@…
Type: enhancement | Status: new
Priority: medium | Milestone:
Component: cf-conventions | Version:
Resolution: | Keywords:
-----------------------------+------------------------------
Comment (by heiko.klein):
I very much appreciate the clarification of the character-set for string
and char variables, but I would like to modify your approach to harmonize
with the NUG, where this is handled differently. The default is since over
10years UTF-8 (this change came together with netcdf4) and the attribute
to specify the character set is named _Encoding rather than 'charset'.
From: http://www.unidata.ucar.edu/software/netcdf/netcdf-4/reqs_new.html
* Strings are stored in UTF-8 Unicode.
* String data is stored without being interpreted by the library, but an
encoding for Unicode strings may be specified with a separate attribute
(e.g. "_Encoding"). A global or group attribute could be used to specify
the encoding of all strings in a file or group.
I propose therefore the following modification:
All char and string variables may include a '_Encoding' attribute to
idenfity the character set (encoding) used by the variable. The value of
the attribute must be the "Preferred MIME Name" or "Name" listed at
http://www.iana.org/assignments/character-sets/character-sets.xhtml .
Charset names are case-insensitive. The recommended charset names are
"ISO-8859-15" and "UTF-8". A missing _Encoding attribute defaults to
UTF-8.
I omit here the 8bit encodings restriction since I don't really see the
point. It is technically possible to use 2chars for one UTF-16 character,
but it is not recommended.
Both UTF-8 and ISO-8859-15 are backwards compatible with 7-bit ASCII
characters, so I dropped the comment about backward compatibility.
I use ISO-8859-15 instead of ISO-8859-1 because -15 is the updated (1999)
version, with the mayor change of including the € sign.
I prefer a strict default over ambiguity, and the UTF-8 default aligns
with the NUG.
--
Ticket URL: <http://cf-trac.llnl.gov/trac/ticket/159#comment:1>
CF Metadata <http://cf-convention.github.io/>
CF Metadata