Re: [CF-metadata] Multiple file datasets

Steve Hankin Fri, 20 Nov 2009 09:44:39 -0800

(My apologies for a rushed answer ... proposal deadlines today.)

The CF group discussed this topic at length in the context of how toinfer membership in a model ensemble: how can CF make it evident thatone model run is a close cousin of another's? The basic strategy thatemerged from those discussions was to embed the necessary semantics forassociating files into the global attributes of the files, rather thanto embed specific linkages into the files. One special global attributeonly would be defined and rigidly standardized by name. It would inturn tell the names of other global attributes that should be consultedto determine ensemble membership. A match of values for all of thoseattributes would indicate ensemble membership. For example


   :ensemble_membership = "institution, model, run_date";
   :institution = "my_institution";
   :model = "my_model";
   :run_date = "my_run_date";

(or whatever -- I just pulled this from the air for illustration).

I believe this is a powerful and general strategy -- applicable toensembles, gridspec, and I suspect the swath problem. On the DOWN side,it means that file linkages are implicit rather than explicit -- i.e.they must be inferred from the file. On the UP side the solution is


   * simple
   * general
   * human readable (in fact, friendly)
   * machine readable (CF awareness in an application would mean
     knowing the name(s) of the standardized global attributes)
   * stable (has no dependencies on file locations or the order of file
     creations)
   * robust (linkages between files can be recreated at any time)

This strategy could fit elegantly with things like the ncML "scan"directives; smart scanners that are pointed to collections of files(either local or remote THREDDS/OPeNDAP accessible) can build theassociations as needed. Ensemble membership, time-aggregationmembership, forecast series membership, gridspec membership -- all canbe properly ordered and sequenced (in principle) through intelligentfile scans based upon CF standardized contents.


   - Steve

================================

V. Balaji wrote:

The gridspec indeed had a proposal about this. Clearly it was a bit
off-topic, but some mechanism of referring to other files was needed. It
consists of an attribute called a link_spec, which has attributes of a
baseURL, a relative pathname, and a checksum for verifying whether the
external file being referenced is indeed the one you're looking for.
There wasn't a special v...@link syntax, but I don't see why it couldn't
have had one.

CMIP5 is proposing a simplified variant on the link_spec. A file
can have a global attribute "associated_files" which are also
formed out of a baseURL and relative pathnames. The only permitted
associated_files are gridspec, and cell areas and volumes that may
be used in cell_methods.

Other approaches have been proposed in this forum, most notably on Trac
#24 and #27, the common_concept thread and Benno's namespace thread.

SAFE has been explained already in this thread.

I agree with John, it would be good to consider this problem in
isolation, without the baggage of gridspecs or common concepts or
namespaces.

John Caron writes:
This topic deserves its own heading, so here it is.
Perhaps we should gather current practices and ideas. I thinkBalaji's gridspec has a proposal about this. Can anyone summarizewhat SAFE does?
Im imagining how this is actually used, eg:

float data(y,x);
data:coordinates = "l...@file1 l...@file2";

????



John Graybeal wrote:
I like Bryan's recommendation for a UUID or similar.
Now I'm going to be annoying and suggest the UUID *could* be a URI,or these days, an IRI (International ..).
And I think the way of 'locating' the file should be neither inpackaging nor in local resolution; it should be in global namespaceresolution. This is the way of the future, and is already more'permanent' than either packaging or local resolution, IMHO.
There is one form of URI in particular that is already resolvable: aURL. OK, that's an old song, but I'm gonna stick to it for a whilelonger. That form meets all the other requirements: it can beregistered in a resolver, it can be guaranteed unique (to the sameauthority level as a UUID, anyway), and it is a unique string thatcan be used to validate the link). And it has the obvious benefit ofbeing resolvable right now, for as long as the domain is held andproperly maintained (Good URLs don't die).
Since the last paragraph risks starting another unique identifierwar, I promise not to re-engage unless someone asks me to.Meanwhile, I like
John


On Nov 19, 2009, at 22:23, Bryan Lawrence wrote:
On Thursday 19 November 2009 19:40:08 Jonathan Gregory wrote:
    ...  In  some cases, referencing attributes such as
"coordinates" and "ancillary_variables" would, ideally,point to a
     variable in a different dataset.
This is a general problem to which CF doesn't have a solutionbecause it wasconceived as a convention for single netCDF files. However we needa solution
as often several files should be treated as a single dataset.
If the files don't overlap i.e. their contents are complementary,I think itshould be satisfactory to allow variables in one file to bepointed to by namefrom another file, with no other mechanism being required withinthe file. Idon't like the idea of naming one file within another file, asthat would bevery fragile. Instead, I think the file aggregation should beimplied bysimply defining the group of files which are to be treated as onefile e.g.
by putting them in one directory.
It's the old ones that are the best ones :-) :-) this issue keepson coming back ... :-) :-) and we keep trying to ignore it ...
I think we agree that an actual physical filename including path isuseless. We need both a relative link which relies on thepreservation of a group of files in a particular arrangement ...AND an internal identifier so more robust linking mechanisms can beused when (if) the data ends up in a managed environment.
I think it's crucial in this situation to ensure that each file hasa unique identifier within it (created, for example, with uuid),because all solutions which rely on packaging are fragile (SAFE isprobably better than most), but the bottom line is that users movefiles around ... and we need some way of ensuring that we/they canvalidate the links that are in place are the ones that wereoriginally intended.
So relative links would also include the identifier of the intendedtarget as well as the relative path in operating system agnosticterms.
That identifier can be used in two ways: to validate the link (mysoftware can always check that the variable that I just openedfollowing a link from another one is the one that was expected bychecking the container identifier), and b) to produce an identifierresolver service for the situation where the packaging has had tobe broken (which might occur for performance reasons or ...)
CF could recommend something like this ...

Bryan

--
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848;
Web: home.badc.rl.ac.uk/lawrence
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
--------------
I have my new work email address: [email protected]
--------------

John Graybeal   <mailto:[email protected]>
phone: 858-534-2162
Development Manager
Ocean Observatories Initiative Cyberinfrastructure Project:http://ci.oceanobservatories.org
Marine Metadata Interoperability Project: http://marinemetadata.org

_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Re: [CF-metadata] Multiple file datasets

Reply via email to