Hi Charlie,

    very good and extensive explanation of the potential use for groups and 
group-aware metadata. Yet, I have a few remarks (which may in part reveal that 
I should probably read the preamble of the CF convention again ;-):

> Point 1: How does the user know she has all the realizations?

Is this question best addressed with metadata in a (series of) file(s)? In a 
modern, interoperable architecture, I would think that this belongs into the 
realm of data discovery, which would be done via web catalogues using metadata 
facets. File-based metadata IMHO may be more prone to failure. Just imagine, 
ECMWF had first generated two ensemble members, and their metadata would say so 
(your "page 2/2" analogy). Now they run another two: do you really expect the 
metadata from the old files to be updated? A web catalogue would provide a more 
robust solution to this question, I believe.

This doesn't mean that it may not be useful to have such information in a file! 
However, to come back to the suitcases: this can only be a packing list for the 
current trip and not an inventory of all the socks you may possibly own. Of 
course your young aspiring researcher may wish to express her knowledge about 
other ensemble members she found on the web but didn't include in the file (the 
suitcase). But will her supervisor or colleague on the other side of the world 
understand what she is talking about? I think, if you intend to go beyond the 
packing list, you open too many cans with too many worms.

> Point 2: Multiple and/or Non-numeric Ensemble axes

Here, you have a  valid point, although - again - I would not connect this to 
knowing " that she has all the models". Yet, within the packed file (the 
suitcase) you want to know which hierarchy model (packing order) was applied in 
order to be able to aggregate things (for example by computing ensemble 
averages). See also my use-case on aircraft data introduced below. Question: 
what happens to this kind of information when the files are flattened and 
re-packaged? It might well become meaningless, which would indicate that these 
are "temporary" metadata, and thus probably out of scope for CF. This actually 
reminds me a bit of my experiences with the history attribute when I use ncks 
-A. This command will preserve the history of one file, but discard the history 
of the other file, which is certainly not the behavior you would like to see in 
ungrouping/re-grouping software.

> Point 3: Weights and intentional reproducibility of MME statistics

In my view this is actually just another viewing angle on your point #2.

--

Your use-case does however highlight the "convenience" of grouping data which 
somehow belong to each other into one file. In a world of flat files, one must 
check coordinates each time when you want to perform some sort of (ensemble) 
averaging operation. A hierarchical file will tell you that it is OK to average 
by placing the common coordinates on the upper level. IMPORTANT: again, this 
doesn't mean that this is the only or best way to do the grouping - yet, it 
seems a compelling advantage to have this coordinate-consistency problem 
eliminated somewhere along your processing steps. As others said already: there 
are reasons for why people use suitcases.

--

Now, here is another use case, which we haven't implemented yet - partly 
because we didn't see how it can be done in a CF consistent way:
While there has been a definition of a standard file layout for data from 
multiple stations (a contribution from Ben Domenico and Stefano Nativi if I am 
not mistaken), this concept cannot be applied to multiple aircraft flight data. 
The station data can be packaged together with help of a non-geophysical 
"station" coordinate, because all stations share the same time axis. With 
aircraft flights, the time axes often don't overlap, and forcing all data onto 
the superset of time would be a tremendous waste of space. Groups would seem as 
 the natural solution to this problem! Why not flat files? Because you might 
wish to retrieve all the aircraft data which were sampled in a given region 
during a specific period (a natural use case for a catalogue query it seems) in 
one entity, and not in N entities, where you cannot even predict N.

I would think the same applies to "granules" of satellite data which share a 
common calibration, for example.

--

As Nan said, we should try to come back to define what is really at stake for 
CF and what exactly shall be proposed. Now this is where my failure to re-read 
the convention preamble may show ;-). The main question is: is CF about files 
or about interoperability?  Unfortunately, my view on this is not entirely 
clear, because it seems to be a bit of both. The standard_names clearly have a 
bearing in the interoperable world, and this shows through various links to the 
CF standard_names in web catalogues or controlled vocabulary collections (e.g. 
SeaDataNet). The conventions themselves seem to be more file-oriented - even 
though the discussions about the data model always make a strong point to go 
beyond representation in a (single) file. [If someone disagrees and wishes to 
see the CF convention play a more important role in interoperability, then I 
would ask why it is not cast into an XML schema extending ISO19115 then. ] If 
CF is indeed "file-oriented", then I do think that it makes a lot of sense to 
support "modern" file structures, which include groups and hierarchies, whether 
we like them or not. Therefore, I would advocate that we focus the discussion 
on two major points with a couple of sub-issues:

1. which parts of CF might fail when we have a hierarchical file? (and let's 
stick to the simple inheritance model of netcdf4 for now!)
1a. what would the current CF checker say if it is fed a hierarchical file?
1b. what happens to global attributes when flat files are grouped together?
1c. do we need to re-phrase some aspects of the convention to make them 
"group-aware"? (this does not include defining new rules - that is covered in 
point 2)
1d. anything else?

2. where do we need to extend the current CF concept?
2a. introduction of a new attribute "level" (equate "global" with "root"? What 
happens when hierarchical files are flattened? [please see the 3 varieties of 
flattening operations mentioned in an earlier post])
2b. specification of "ensemble_..." attributes? "ensemble_axis" may not be 
needed of these axes are defined on the group level (?) Something like 
"ensemble_history" or "ensemble_structure" to inform the user about the 
grouping principle?
2c. what other "relations" need to be expressed within a hierarchical file? The 
guiding principle here should be that additional rules are only needed if they 
avoid ambiguity and misinterpretation of the data. And here we get onto 
interoperability territory again (see my use case about aircraft data above).


Sorry for this long post -- this just somehow seems to be quite relevant!

Best regards,

Martin


--------------------------------------------------------------------------------
PD Dr. Martin G. Schultz
IEK-8, Forschungszentrum Jülich
D-52425 Jülich
Ph: +49 2461 61 2831




------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Das Forschungszentrum oeffnet seine Tueren am Sonntag, 29. September, von 10:00 
bis 17:00 Uhr: http://www.tagderneugier.de
_______________________________________________
CF-metadata mailing list
CF-metadata@cgd.ucar.edu
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Reply via email to