(My apologies for a rushed answer ... proposal deadlines today.)
The CF group discussed this topic at length in the context of how to
infer membership in a model ensemble: how can CF make it evident that
one model run is a close cousin of another's? The basic strategy that
emerged from those discussions was to embed the necessary semantics for
associating files into the global attributes of the files, rather than
to embed specific linkages into the files. One special global attribute
only would be defined and rigidly standardized by name. It would in
turn tell the names of other global attributes that should be consulted
to determine ensemble membership. A match of values for all of those
attributes would indicate ensemble membership. For example
:ensemble_membership = "institution, model, run_date";
:institution = "my_institution";
:model = "my_model";
:run_date = "my_run_date";
(or whatever -- I just pulled this from the air for illustration).
I believe this is a powerful and general strategy -- applicable to
ensembles, gridspec, and I suspect the swath problem. On the DOWN side,
it means that file linkages are implicit rather than explicit -- i.e.
they must be inferred from the file. On the UP side the solution is
* simple
* general
* human readable (in fact, friendly)
* machine readable (CF awareness in an application would mean
knowing the name(s) of the standardized global attributes)
* stable (has no dependencies on file locations or the order of file
creations)
* robust (linkages between files can be recreated at any time)
This strategy could fit elegantly with things like the ncML "scan"
directives; smart scanners that are pointed to collections of files
(either local or remote THREDDS/OPeNDAP accessible) can build the
associations as needed. Ensemble membership, time-aggregation
membership, forecast series membership, gridspec membership -- all can
be properly ordered and sequenced (in principle) through intelligent
file scans based upon CF standardized contents.
- Steve
================================
V. Balaji wrote:
The gridspec indeed had a proposal about this. Clearly it was a bit
off-topic, but some mechanism of referring to other files was needed. It
consists of an attribute called a link_spec, which has attributes of a
baseURL, a relative pathname, and a checksum for verifying whether the
external file being referenced is indeed the one you're looking for.
There wasn't a special v...@link syntax, but I don't see why it couldn't
have had one.
CMIP5 is proposing a simplified variant on the link_spec. A file
can have a global attribute "associated_files" which are also
formed out of a baseURL and relative pathnames. The only permitted
associated_files are gridspec, and cell areas and volumes that may
be used in cell_methods.
Other approaches have been proposed in this forum, most notably on Trac
#24 and #27, the common_concept thread and Benno's namespace thread.
SAFE has been explained already in this thread.
I agree with John, it would be good to consider this problem in
isolation, without the baggage of gridspecs or common concepts or
namespaces.
John Caron writes:
This topic deserves its own heading, so here it is.
Perhaps we should gather current practices and ideas. I think
Balaji's gridspec has a proposal about this. Can anyone summarize
what SAFE does?
Im imagining how this is actually used, eg:
float data(y,x);
data:coordinates = "l...@file1 l...@file2";
????
John Graybeal wrote:
I like Bryan's recommendation for a UUID or similar.
Now I'm going to be annoying and suggest the UUID *could* be a URI,
or these days, an IRI (International ..).
And I think the way of 'locating' the file should be neither in
packaging nor in local resolution; it should be in global namespace
resolution. This is the way of the future, and is already more
'permanent' than either packaging or local resolution, IMHO.
There is one form of URI in particular that is already resolvable: a
URL. OK, that's an old song, but I'm gonna stick to it for a while
longer. That form meets all the other requirements: it can be
registered in a resolver, it can be guaranteed unique (to the same
authority level as a UUID, anyway), and it is a unique string that
can be used to validate the link). And it has the obvious benefit of
being resolvable right now, for as long as the domain is held and
properly maintained (Good URLs don't die).
Since the last paragraph risks starting another unique identifier
war, I promise not to re-engage unless someone asks me to.
Meanwhile, I like
John
On Nov 19, 2009, at 22:23, Bryan Lawrence wrote:
On Thursday 19 November 2009 19:40:08 Jonathan Gregory wrote:
... In some cases, referencing attributes such as
"coordinates" and "ancillary_variables" would, ideally,
point to a
variable in a different dataset.
This is a general problem to which CF doesn't have a solution
because it was
conceived as a convention for single netCDF files. However we need
a solution
as often several files should be treated as a single dataset.
If the files don't overlap i.e. their contents are complementary,
I think it
should be satisfactory to allow variables in one file to be
pointed to by name
from another file, with no other mechanism being required within
the file. I
don't like the idea of naming one file within another file, as
that would be
very fragile. Instead, I think the file aggregation should be
implied by
simply defining the group of files which are to be treated as one
file e.g.
by putting them in one directory.
It's the old ones that are the best ones :-) :-) this issue keeps
on coming back ... :-) :-) and we keep trying to ignore it ...
I think we agree that an actual physical filename including path is
useless. We need both a relative link which relies on the
preservation of a group of files in a particular arrangement ...
AND an internal identifier so more robust linking mechanisms can be
used when (if) the data ends up in a managed environment.
I think it's crucial in this situation to ensure that each file has
a unique identifier within it (created, for example, with uuid),
because all solutions which rely on packaging are fragile (SAFE is
probably better than most), but the bottom line is that users move
files around ... and we need some way of ensuring that we/they can
validate the links that are in place are the ones that were
originally intended.
So relative links would also include the identifier of the intended
target as well as the relative path in operating system agnostic
terms.
That identifier can be used in two ways: to validate the link (my
software can always check that the variable that I just opened
following a link from another one is the one that was expected by
checking the container identifier), and b) to produce an identifier
resolver service for the situation where the packaging has had to
be broken (which might occur for performance reasons or ...)
CF could recommend something like this ...
Bryan
--
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848;
Web: home.badc.rl.ac.uk/lawrence
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
--------------
I have my new work email address: [email protected]
--------------
John Graybeal <mailto:[email protected]>
phone: 858-534-2162
Development Manager
Ocean Observatories Initiative Cyberinfrastructure Project:
http://ci.oceanobservatories.org
Marine Metadata Interoperability Project: http://marinemetadata.org
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata