Hi all,
Again, I may be unaware of all the possible uses of hierarchies, but
here's our experience with CMIP.
It seems to me if hierarchies are for the purpose of "organizing"
datasets (or organizing a bunch of files), this should fall outside CF's
purview because a single hierarchy is rarely ideal for all purposes.
For CMIP we place files in a hierarchical directory structure based on
the global attributes stored. We also bundle collections of files into
datasets, but that's for practical reasons imposed by the ESGF search
engine that can't efficiently handle millions of files, but is able to
handle 10's of thousands of datasets. The collections imply a single
level hierarchy. Note that outside of ESGF users would normally choose
not to define "datasets" in the same way that we do in ESGF.
In general I think hierarchies can be useful in organizing data, but
rarely will everyone agree on what hierarchy is most convenient, so I
don't see why such hierarchies need to be included in CF. The global
attributes, on the other hand, are fundamental and can be used in
flexible ways to produce whatever hierarchy might be best for a given
situation. In CMIP some of the global attributes normally used to
construct directory structures are: institution name, model name,
experiment name, sampling frequency (e.g., monthly, daily, 3-hourly),
realm (e.g., atmosphere, ocean, land), "realization" (for ensembles of
runs differing only slightly), variable name. The hierarchy suited to
the CMIP archive places the model name at a fairly high level (because
the data are stored at nodes hosted by individual modeling centers; the
distributed dataset can be accessed through a single ESGF portal). Once
the user downloads the data, however, a more appropriate structure might
be to place the variable name at a high level and then near the bottom
of the hierarchy you would find out which models had output that variable.
I agree hierarchies of directories can be quite useful when trying to
find what you need, but the need for flexibility suggests to me that
those hierarchies should appear outside CF. Hierarchies don't seem to
me to be intrinsically needed to make data files self-describing. [In
CMIP the data gets associated with "groups" simply by defining the
global attributes I listed above.]
best regards,
Karl
On 9/19/13 6:55 AM, Corey Bettenhausen wrote:
On Sep 18, 2013, at 12:32 PM, Steve Hankin wrote:
On 9/18/2013 7:56 AM, Roy Mendelssohn - NOAA Federal wrote:
Hi All:
NASA has used hierarchies for years, and appears committed to them. So, either
it is done in an ad hoc way, or through a standard. That doesn't mean CF is
the place for the standard, just that it would be nice to have one.
Roy,
Lets explore the avenue you have opened here: "that doesn't mean CF is the place
for the standard". The need for hierarchies as tools for programming is
indisputable. But will hierarchical groups advance the interoperability objectives of CF?
Steve,
Speaking for myself, I use groups in data files to organize the various
datasets so that a person looking at the file via the commandline (h5dump,
ncdump) or application (HDFView, Panoply) can find the dataset they're
interested in easily. For instance, in our swath-level (L2) data, we have a
number of datasets that aren't really that relevant to our end users, but could
come in handy when diagnosing a problem with the algorithm or to monitor
algorithm performance. So these diagnostic datasets don't clutter up the
output, we've put them into a separate group from the main datasets.
So, in this case, do the groups make the files more interoperable? Not really,
if we're talking about a completely software-driven system. But this *does*
make them more user-friendly, and we'd definitely like to maximize our
compatibility as well with those software-driven processes. Why not have the
best of both worlds? Hence, I'm fully supporting CF incorporate groups into
the conventions. I think Charlie's proposal is an excellent starting point.
Cheers,
-Corey
At the start of this discussion I had assumed that there would be compelling
examples that supported the introduction of hierarchies to CF. Thus far all
that have been put on display seem to be counter-examples(*):
• For CMIP5 any given hierarchy is an arbitrary, brittle
representation. The CMIP5 collection is better modeled by facets (metadata
tags) than by hierarchies.
• The suitcase analogy serves best to illustrate the problems that
hierarchies can bring -- to locate the black socks in a suitcase usually
involves rummaging the entire suitcase.
• ==> Which speaks to Rich's valid concern that the
data-discovery-to-data-access transition may be very negatively impacted if
hierarchies are not used carefully.
• NASA hierarchies that are 10 levels deep strike me as by definition an
"insider" view of a data collection. These hierarchies may add clarity for the
specific satellite program communicating with its designated science groups, but they are
likely a barrier to an outsider wanting to utilize the data.
To proceed forward we need to see some compelling use cases that will help us
to understand the costs and benefits?
- Steve
(*) with the exception of Feature Collections types already contained in CF
=================================================
I would point out that every major modern programming language has structures,
which are essentially hierarchies. Matlab was criticized for years about not
having structures, and finally added them a few years back. R has them, C has
them, Python has them, even modern Fortran has them. So clearly there must be
situations where hierarchies make sense, and are more efficient than having
everything flat. There are clearly situations where flattening everything
makes sense.
My $0.02.
-Roy
On Sep 18, 2013, at 4:52 AM, "Signell, Richard"
<[email protected]>
wrote:
All,
I'm glad we are discussing this topic, but the fact that large data
providers are already distributing data using groups and hierarchies
is not a compelling reason to endorse this practice through CF. After
all, a lot of data providers are currently distributing scientific
data in any number of forms, and the point of CF (along with OGC
standards) is to help clean up the mess!
I agree that groups make sense for metadata and for certain types of
datasets. For example, the discrete sampling geometry featureTypes
like profile collection would be easier to understand and deal with as
a netcdf4 group of profiles rather than as a netcdf3 ragged array.
But the choice was made for CF 1.6 that backward compatibility was
more important.
I don't think it's cowardly to belive that the more folks use groups
to organize their data in an ad hoc way (the suitcase analogy), the
more it will hinder the remarkable progress that has been made
recently on finding and utilizing distributed CF data via the catalog
services (e.g. the geonetwork, gi-cat, geoportal, CKAN instances) that
many governments are setting up. When we open the data service
endpoints that our query returns, we need to have known data
structures, and that's what the CF featureTypes provide.
To return to the suitcase/clothing analogy again, we are rapidly
gaining the capability via good metadata and catalog services to find
all the black socks owned by Jim and Martin that have been washed in
the last week. But if our catalog query returns fourteen of Jim's
suitcases and twelve of Martin's, then we have more work to do.
Unlike socks, luckily we don't need actual suitcases to organize data,
we can construct collections on the fly using whatever attributes we
desire.
I would hope that our job as the CF community would be to identify
compelling additional specific featureTypes that we should support.
And if these identified featureTypes demand groups for efficiency or
some other reason, well, let's have that discussion.
-Rich
On Wed, Sep 18, 2013 at 12:08 AM, Roy Mendelssohn - NOAA Federal
<[email protected]>
wrote:
Hi All:
I am old and slow, and I must be missing something, because at this point most
of the discussion has been about the desirability of files with groups and
hierarchies. Again, unless I am missing something, there already are data
providers who are distributing data using groups and hierarchies, including at
least one very large data provider, and they obviously feel that there is a
benefit to such structures. I am not arguing whether they are right or wrong,
just that is the reality.
If we start from that premise, then the real questions for discussion are
should there be conventions on how groups and hierarchies are used in netcdf4
and hdf5 files, so that a user or software provider will know what to expect,
and the second question is if it is deemed desirable to have such conventions,
is CF the proper place for them to be developed.
My sense it that this is what the original proposers are after.
-Roy
**********************
"The contents of this message do not reflect any position of the U.S. Government or
NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division
Southwest Fisheries Science Center
1352 Lighthouse Avenue
Pacific Grove, CA 93950-2097
e-mail:
[email protected]
(Note new e-mail address)
voice: (831)-648-9029
fax: (831)-648-8440
www:
http://www.pfeg.noaa.gov/
"Old age and treachery will overcome youth and skill."
"From those who have been given much, much will be expected"
"the arc of the moral universe is long, but it bends toward justice" -MLK Jr.
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
--
Dr. Richard P. Signell (508) 457-2229
USGS, 384 Woods Hole Rd.
Woods Hole, MA 02543-1598
**********************
"The contents of this message do not reflect any position of the U.S. Government or
NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division
Southwest Fisheries Science Center
1352 Lighthouse Avenue
Pacific Grove, CA 93950-2097
e-mail:
[email protected]
(Note new e-mail address)
voice: (831)-648-9029
fax: (831)-648-8440
www:
http://www.pfeg.noaa.gov/
"Old age and treachery will overcome youth and skill."
"From those who have been given much, much will be expected"
"the arc of the moral universe is long, but it bends toward justice" -MLK Jr.
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata