NASA has recently convened an Earth Science Data System Working Group to explore existing conventions for data and products stored in HDF and to make recommendations for future developments. The CF Conventions are an important element in this work, as many scientists and users are interested in data products that comply with CF. Many members of the working group are familiar with CF and have been involved in attempts to apply the CF Conventions to a variety of Earth Science data products.
We have identified a persistent barrier to NASA's greater adoption of CF: the lack of protocols for exploiting software-defined group hierarchies for data structures. HDF datasets traditionally collected and stewarded by NASA often utilize hierarchical (the "H" in HDF) groups. A chief advantage of netCDF4 over netCDF3 is that it supports a group API compatible with HDF. Here we outline an approach to incorporating groups into CF as a step towards recognizing and, eventually, exploiting groups. Some aspects of CF (especially the netCDF Conventions like _FillValue, valid_min) can apply unambiguously to HDF files that use groups, but other aspects of CF conventions have room for ambiguity when applied to such HDF files. Clarifying that ambiguity is one role of conventions, so we would like to start a discussion with the aim of obtaining feedback, gathering consensus, and eventually, possibly, embedding "group-awareness" into CF. Unidata's white paper on Conventions for netCDF4 (http://www.unidata.ucar.edu/software/netcdf/papers/nc4_conventions.html) began the discussion of potential "group-aware" CF capabilities. Some previous discussion of "group-aware" CF metadata is contained or referenced in CF-Metadata Trac tickets 79 (Handling and formatting of vector quantities in CF) and 90 (Collection of CF enhancements for interoperable applications) yet the "big discussion" on how/whether CF should exploit the hierarchical group capabilities of netCDF4 is unfinished. Below we propose a standard scheme for interpreting metadata scope in hierarchical (group) files, and suggest one or two new Group Attributes which we could turn into concrete proposals if interest warrants. Perhaps the most obvious place to start a discussion on making CF "group-aware" is the notion of attribute scope: How ought metadata in one group apply, if at all, to other groups? CF metadata attributes may be applied at the group level (netCDF4 allows this) yet what should that mean? Whereas the current CF Convention speaks only of Global Attributes and Variable Attributes, a "group-aware" CF must explicitly define the properties of a third category of attributes, Group Attributes. Global Attributes are a special case of Group Attributes and should share their properties. The key technical definition we propose is that Group Attributes shall apply to the group where they are defined and to its descendents, but not to that group's ancestors or siblings. Group Attributes apply to all a group's descendents recursively with an exception: Any group may redefine an attribute defined in an ancestor group, and that child-group's definition applies to all its descendents. Thus in cases where multiple ancestor groups define the same attribute, attribute values are inherited from the nearest ancestor. Note that these are the same scoping properties as netCDF4 dimensions. Our understanding is that this proposal is consistent and backwards-compatible with CF. However, it would extend the current usage of CF to files with arbitrary hierarchies of groups. Moreover, it might be helpful to specifically disallow (or mark as having undefined consequences) the use of Group Attributes to store metadata that should always be attached directly to variables. Group Attributes such as _FillValue, scale_factor, valid_min, might sometimes seem tempting yet might create more problems than they would solve. Some attributes (e.g., Convention) may be useful only as Global Attributes, and not as Group Attributes for other groups. What would a "group-aware" CF Convention mean in practice? It is important to preserve CF backwards compatibility. The metadata annotation of flat files (e.g., all netCDF3 files) need not be affected by any "group-aware" CF Convention extensions. Files with group hierarchies would continue to have Global Attributes (i.e., Group Attributes at the root group level). Global Attributes are almost always useful because they apply to the entire file except where superceded by an attribute of the same name at a lower level. Where group-oriented attribute conventions would help, we believe, is in extending the power of CF unambiguously to nested groups. Imagine a group file in which each top level group holds model results from a distinct CMIP5 simulation (CCSM, ECMWF, GISS, etc.). Or where each top level group holds a different satellite-retrieved value of the same field (ERBE OLR, CERES OLR, etc.), or a different channel from the same multi-spectral radiometer. It may be helpful to know the relation of groups to other groups, so that users and tools can learn which are (or aren't) intercomparable or aggregable. Properties of ensembles stored as groups that would be helpful to know, in an automated way, by analysis tools (such as NCO) include: Which groups contain the other ensemble realizations? Which groups hold other channels of a multi-spectral instrument? Knowing this information would help users and analysis tools infer how best to create ensemble statistics, and could significantly reduce the overall number of files confronting users. Finally, groups allow containerization of information which can be useful in avoiding repetition. Some would like to define metadata-only groups that could then be logically attached to apply to some or all other groups in a file. Is it desirable for CF to define a standard way to indicate this? As the previous examples illustrate, there are at least two levels to a discussion about "group-aware" CF. The first is scope, i.e., how attribute meanings are inherited in hierarchies. The second is the more pragmatic issue of what new CF attributes would allow us to exploit group hierarchies in a systematic way. We proposed an answer to the scope issue to kickstart the discussion. We illustrated how a new attribute (call it "ensemble" for now) might be useful. At this stage we wish to learn whether CF users/developers are interested in pursuing "group-aware" CF extensions at all before we develop more details/wording for specific conventions. Perhaps there are others working on similar issues, or perhaps the CF maintainers prefer to receive specific wording of proposals rather than more diffuse "invitations to discuss" like this. If you have an opinion, then please let us know. Until the CF (or some other) Convention tackles the issues of scoping and Group Attributes, such annotations will be ad hoc. Our goal is to increase interoperability, and we are eager to hear responses from the CF community on the direction of "group-aware" extensions to CF. On behalf of the NASA ESDS HDF5 WG, Charlie Zender, Ted Habermann, and Peter Leonard -- Charlie Zender, Earth System Sci. & Computer Sci. University of California, Irvine 949-891-2429 )'( _______________________________________________ CF-metadata mailing list [email protected] http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
