Re: [CF Metadata] #159: charset attribute

CF Metadata Tue, 28 Feb 2017 00:02:21 -0800

#159: charset attribute
-----------------------------+------------------------------
  Reporter:  bob.simons      |      Owner:  cf-conventions@…
      Type:  enhancement     |     Status:  new
  Priority:  medium          |  Milestone:
 Component:  cf-conventions  |    Version:
Resolution:                  |   Keywords:
-----------------------------+------------------------------


Comment (by heiko.klein):

 I very much appreciate the clarification of the character-set for string
 and char variables, but I would like to modify your approach to harmonize
 with the NUG, where this is handled differently. The default is since over
 10years UTF-8 (this change came together with netcdf4) and the attribute
 to specify the character set is named _Encoding rather than 'charset'.

 From: http://www.unidata.ucar.edu/software/netcdf/netcdf-4/reqs_new.html
   * Strings are stored in UTF-8 Unicode.
   * String data is stored without being interpreted by the library, but an
 encoding for Unicode strings may be specified with a separate attribute
 (e.g. "_Encoding"). A global or group attribute could be used to specify
 the encoding of all strings in a file or group.

 I propose therefore the following modification:


    All char and string variables may include a '_Encoding' attribute to
 idenfity the character set (encoding) used by the variable. The value of
 the attribute must be the "Preferred MIME Name" or "Name" listed at
 http://www.iana.org/assignments/character-sets/character-sets.xhtml .
 Charset names are case-insensitive. The recommended charset names are
 "ISO-8859-15" and "UTF-8". A missing _Encoding attribute defaults to
 UTF-8.



 I omit here the 8bit encodings restriction since I don't really see the
 point. It is technically possible to use 2chars for one UTF-16 character,
 but it is not recommended.

 Both UTF-8 and ISO-8859-15 are backwards compatible with 7-bit ASCII
 characters, so I dropped the comment about backward compatibility.

 I use ISO-8859-15 instead of ISO-8859-1 because -15 is the updated (1999)
 version, with the mayor change of including the € sign.

 I prefer a strict default over ambiguity, and the UTF-8 default aligns
 with the NUG.

--
Ticket URL: <http://cf-trac.llnl.gov/trac/ticket/159#comment:1>
CF Metadata <http://cf-convention.github.io/>
CF Metadata

Re: [CF Metadata] #159: charset attribute

Reply via email to