Hi Charlie,
very good and extensive explanation of the potential use for groups and
group-aware metadata. Yet, I have a few remarks (which may in part reveal that
I should probably read the preamble of the CF convention again ;-):
> Point 1: How does the user know she has all the realizations?
Is this question best addressed with metadata in a (series of) file(s)? In a
modern, interoperable architecture, I would think that this belongs into the
realm of data discovery, which would be done via web catalogues using metadata
facets. File-based metadata IMHO may be more prone to failure. Just imagine,
ECMWF had first generated two ensemble members, and their metadata would say so
(your "page 2/2" analogy). Now they run another two: do you really expect the
metadata from the old files to be updated? A web catalogue would provide a more
robust solution to this question, I believe.
This doesn't mean that it may not be useful to have such information in a file!
However, to come back to the suitcases: this can only be a packing list for the
current trip and not an inventory of all the socks you may possibly own. Of
course your young aspiring researcher may wish to express her knowledge about
other ensemble members she found on the web but didn't include in the file (the
suitcase). But will her supervisor or colleague on the other side of the world
understand what she is talking about? I think, if you intend to go beyond the
packing list, you open too many cans with too many worms.
> Point 2: Multiple and/or Non-numeric Ensemble axes
Here, you have a valid point, although - again - I would not connect this to
knowing " that she has all the models". Yet, within the packed file (the
suitcase) you want to know which hierarchy model (packing order) was applied in
order to be able to aggregate things (for example by computing ensemble
averages). See also my use-case on aircraft data introduced below. Question:
what happens to this kind of information when the files are flattened and
re-packaged? It might well become meaningless, which would indicate that these
are "temporary" metadata, and thus probably out of scope for CF. This actually
reminds me a bit of my experiences with the history attribute when I use ncks
-A. This command will preserve the history of one file, but discard the history
of the other file, which is certainly not the behavior you would like to see in
ungrouping/re-grouping software.
> Point 3: Weights and intentional reproducibility of MME statistics
In my view this is actually just another viewing angle on your point #2.
--
Your use-case does however highlight the "convenience" of grouping data which
somehow belong to each other into one file. In a world of flat files, one must
check coordinates each time when you want to perform some sort of (ensemble)
averaging operation. A hierarchical file will tell you that it is OK to average
by placing the common coordinates on the upper level. IMPORTANT: again, this
doesn't mean that this is the only or best way to do the grouping - yet, it
seems a compelling advantage to have this coordinate-consistency problem
eliminated somewhere along your processing steps. As others said already: there
are reasons for why people use suitcases.
--
Now, here is another use case, which we haven't implemented yet - partly
because we didn't see how it can be done in a CF consistent way:
While there has been a definition of a standard file layout for data from
multiple stations (a contribution from Ben Domenico and Stefano Nativi if I am
not mistaken), this concept cannot be applied to multiple aircraft flight data.
The station data can be packaged together with help of a non-geophysical
"station" coordinate, because all stations share the same time axis. With
aircraft flights, the time axes often don't overlap, and forcing all data onto
the superset of time would be a tremendous waste of space. Groups would seem as
the natural solution to this problem! Why not flat files? Because you might
wish to retrieve all the aircraft data which were sampled in a given region
during a specific period (a natural use case for a catalogue query it seems) in
one entity, and not in N entities, where you cannot even predict N.
I would think the same applies to "granules" of satellite data which share a
common calibration, for example.
--
As Nan said, we should try to come back to define what is really at stake for
CF and what exactly shall be proposed. Now this is where my failure to re-read
the convention preamble may show ;-). The main question is: is CF about files
or about interoperability? Unfortunately, my view on this is not entirely
clear, because it seems to be a bit of both. The standard_names clearly have a
bearing in the interoperable world, and this shows through various links to the
CF standard_names in web catalogues or controlled vocabulary collections (e.g.
SeaDataNet). The conventions themselves seem to be more file-oriented - even
though the discussions about the data model always make a strong point to go
beyond representation in a (single) file. [If someone disagrees and wishes to
see the CF convention play a more important role in interoperability, then I
would ask why it is not cast into an XML schema extending ISO19115 then. ] If
CF is indeed "file-oriented", then I do think that it makes a lot of sense to
support "modern" file structures, which include groups and hierarchies, whether
we like them or not. Therefore, I would advocate that we focus the discussion
on two major points with a couple of sub-issues:
1. which parts of CF might fail when we have a hierarchical file? (and let's
stick to the simple inheritance model of netcdf4 for now!)
1a. what would the current CF checker say if it is fed a hierarchical file?
1b. what happens to global attributes when flat files are grouped together?
1c. do we need to re-phrase some aspects of the convention to make them
"group-aware"? (this does not include defining new rules - that is covered in
point 2)
1d. anything else?
2. where do we need to extend the current CF concept?
2a. introduction of a new attribute "level" (equate "global" with "root"? What
happens when hierarchical files are flattened? [please see the 3 varieties of
flattening operations mentioned in an earlier post])
2b. specification of "ensemble_..." attributes? "ensemble_axis" may not be
needed of these axes are defined on the group level (?) Something like
"ensemble_history" or "ensemble_structure" to inform the user about the
grouping principle?
2c. what other "relations" need to be expressed within a hierarchical file? The
guiding principle here should be that additional rules are only needed if they
avoid ambiguity and misinterpretation of the data. And here we get onto
interoperability territory again (see my use case about aircraft data above).
Sorry for this long post -- this just somehow seems to be quite relevant!
Best regards,
Martin
--------------------------------------------------------------------------------
PD Dr. Martin G. Schultz
IEK-8, Forschungszentrum Jülich
D-52425 Jülich
Ph: +49 2461 61 2831
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Das Forschungszentrum oeffnet seine Tueren am Sonntag, 29. September, von 10:00
bis 17:00 Uhr: http://www.tagderneugier.de
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata