Hi Charlie,
Great that you have opened the door onto this discussion topic. Total
agreement from my pov that "group-awareness" in CF is an area that is
crying to be explored and solved. Your analysis of technical details
-- e.g. attribute scope and inheritance by group descendents, etc. --
sounds natural and sensible.
The principle barrier to moving forward along this path lies in the fact
that CF is heavily committed to interoperability. Arguably
interoperability is the raison d'ĂȘtre of CF. The style of "backwards
compatibility" that one gets through a headlong switch from the netCDF3
API into the group-oriented elements of the netCDF4 API is the most
extreme sort of 1-way trap door. It leads to next generation files that
are utterly inaccessible to previous generation applications. This
style of advancement, which heavily degrades interoperability from a
community-wide perspective, should only be undertaken IMHO if all
reasonable alternatives are exhausted. I welcome discussion on this
"philosophical" point.
So are there reasonable alternatives to these negative impacts on
interoperabilty? It is common practice to flatten groups by
dot-appending the name hierarchy: group.subgroup.child. One could
certainly envision utilities (in the style of your nco) that could
convert a netCDF4-CF file into a netCDf-3 CF file. Could such a
translation layer made available as a Web service? Not the worst answer
.... A question I'd like to see discussed (primarily to Unidata, I
guess): how difficult would it be to make accommodations in the netCDF
API, itself, so that netCDF4 groups were accessible through the netCDF-3
API. If such enhancements could be baked into the netCDF code the
character of the interoperability impacts through adding group-aware
elements to CF would be utterly changed. This would open the door wide
to group-aware CF. Has an analysis of this been done?
- Steve
==============================================
On 9/15/2013 6:53 PM, Charlie Zender wrote:
NASA has recently convened an Earth Science Data System Working Group to
explore existing conventions for data and products stored in HDF and to
make recommendations for future developments. The CF Conventions are an
important element in this work, as many scientists and users are
interested in data products that comply with CF. Many members of the
working group are familiar with CF and have been involved in attempts to
apply the CF Conventions to a variety of Earth Science data products.
We have identified a persistent barrier to NASA's greater adoption of
CF: the lack of protocols for exploiting software-defined group
hierarchies for data structures. HDF datasets traditionally collected
and stewarded by NASA often utilize hierarchical (the "H" in HDF)
groups. A chief advantage of netCDF4 over netCDF3 is that it supports a
group API compatible with HDF. Here we outline an approach to
incorporating groups into CF as a step towards recognizing and,
eventually, exploiting groups.
Some aspects of CF (especially the netCDF Conventions like _FillValue,
valid_min) can apply unambiguously to HDF files that use groups, but
other aspects of CF conventions have room for ambiguity when applied to
such HDF files. Clarifying that ambiguity is one role of conventions, so
we would like to start a discussion with the aim of obtaining feedback,
gathering consensus, and eventually, possibly, embedding
"group-awareness" into CF. Unidata's white paper on Conventions for
netCDF4
(http://www.unidata.ucar.edu/software/netcdf/papers/nc4_conventions.html) began
the discussion of potential "group-aware" CF capabilities. Some previous
discussion of "group-aware" CF metadata is contained or referenced in
CF-Metadata Trac tickets 79 (Handling and formatting of vector
quantities in CF) and 90 (Collection of CF enhancements for
interoperable applications) yet the "big discussion" on how/whether CF
should exploit the hierarchical group capabilities of netCDF4 is
unfinished. Below we propose a standard scheme for interpreting metadata
scope in hierarchical (group) files, and suggest one or two new Group
Attributes which we could turn into concrete proposals if interest warrants.
Perhaps the most obvious place to start a discussion on making CF
"group-aware" is the notion of attribute scope: How ought metadata in
one group apply, if at all, to other groups? CF metadata attributes may
be applied at the group level (netCDF4 allows this) yet what should that
mean? Whereas the current CF Convention speaks only of Global Attributes
and Variable Attributes, a "group-aware" CF must explicitly define the
properties of a third category of attributes, Group Attributes. Global
Attributes are a special case of Group Attributes and should share their
properties.
The key technical definition we propose is that Group Attributes shall
apply to the group where they are defined and to its descendents, but
not to that group's ancestors or siblings. Group Attributes apply to all
a group's descendents recursively with an exception: Any group may
redefine an attribute defined in an ancestor group, and that
child-group's definition applies to all its descendents. Thus in cases
where multiple ancestor groups define the same attribute, attribute
values are inherited from the nearest ancestor. Note that these are the
same scoping properties as netCDF4 dimensions.
Our understanding is that this proposal is consistent and
backwards-compatible with CF. However, it would extend the current usage
of CF to files with arbitrary hierarchies of groups. Moreover, it might
be helpful to specifically disallow (or mark as having undefined
consequences) the use of Group Attributes to store metadata that should
always be attached directly to variables. Group Attributes such as
_FillValue, scale_factor, valid_min, might sometimes seem tempting yet
might create more problems than they would solve. Some attributes (e.g.,
Convention) may be useful only as Global Attributes, and not as Group
Attributes for other groups.
What would a "group-aware" CF Convention mean in practice? It is
important to preserve CF backwards compatibility. The metadata
annotation of flat files (e.g., all netCDF3 files) need not be affected
by any "group-aware" CF Convention extensions.
Files with group hierarchies would continue to have Global Attributes
(i.e., Group Attributes at the root group level). Global Attributes are
almost always useful because they apply to the entire file except where
superceded by an attribute of the same name at a lower level. Where
group-oriented attribute conventions would help, we believe, is in
extending the power of CF unambiguously to nested groups.
Imagine a group file in which each top level group holds model results
from a distinct CMIP5 simulation (CCSM, ECMWF, GISS, etc.). Or where
each top level group holds a different satellite-retrieved value of the
same field (ERBE OLR, CERES OLR, etc.), or a different channel from the
same multi-spectral radiometer. It may be helpful to know the relation
of groups to other groups, so that users and tools can learn which are
(or aren't) intercomparable or aggregable. Properties of ensembles
stored as groups that would be helpful to know, in an automated way, by
analysis tools (such as NCO) include: Which groups contain the other
ensemble realizations? Which groups hold other channels of a
multi-spectral instrument? Knowing this information would help users and
analysis tools infer how best to create ensemble statistics, and could
significantly reduce the overall number of files confronting users.
Finally, groups allow containerization of information which can be
useful in avoiding repetition. Some would like to define metadata-only
groups that could then be logically attached to apply to some or all
other groups in a file. Is it desirable for CF to define a standard way
to indicate this?
As the previous examples illustrate, there are at least two levels to a
discussion about "group-aware" CF. The first is scope, i.e., how
attribute meanings are inherited in hierarchies. The second is the more
pragmatic issue of what new CF attributes would allow us to exploit
group hierarchies in a systematic way. We proposed an answer to the
scope issue to kickstart the discussion. We illustrated how a new
attribute (call it "ensemble" for now) might be useful. At this stage we
wish to learn whether CF users/developers are interested in pursuing
"group-aware" CF extensions at all before we develop more
details/wording for specific conventions. Perhaps there are others
working on similar issues, or perhaps the CF maintainers prefer to
receive specific wording of proposals rather than more diffuse
"invitations to discuss" like this. If you have an opinion, then please
let us know.
Until the CF (or some other) Convention tackles the issues of scoping
and Group Attributes, such annotations will be ad hoc. Our goal is to
increase interoperability, and we are eager to hear responses from the
CF community on the direction of "group-aware" extensions to CF.
On behalf of the NASA ESDS HDF5 WG,
Charlie Zender, Ted Habermann, and Peter Leonard
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata