Dear all Using metadata to describe a file internally is much more robust than encoding filenames in a file, and therefore seems preferable to me. I am concerned that the gridspec proposal suggests using relative filenames in an attribute in the file (associate_files). If I understand correctly, that would mean it would be broken if you chosen to give your local copies of CMIP5 files different names or store them in a different directory arrangement from the one they have in the archive. (I don't keep my CMIP3 data in the same arrangement as it is held in PCMDI.)
I assume that CMIP5 CF-netCDF files will have other metadata in them identifying the institution and so on, as global attributes, along the lines that Steve mentions. But to make life easier for software which wants to verify the relationship of the files, a UUID or a URL would be useful, wouldn't it? I would suggest that those responsible for archiving and distributing CMIP5 data (PCMDI and others) could assign a unique identifier for each model- scenario-ensemble_member as the data is made available by the modelling centre. These are data-spaces within which CF metadata will distinguish every datum, but the data-spaces themselves need to be distinguished. A unique identifier would be more robust than a combination of more descriptive metadata, as well as easier to process, I would say. Tagged with a unique identifier (UUID or URL), there should be no need for a checksum on the file. There should be only one file having the right gridspec details for a given UUID. References within the file to variables in other files can be just by variable name, without any extended @ syntax. But it is too restrictive to insist that variable names can't be repeated within the group of files sharing a UUID, since often they will be: if the files are organised by time ranges, for instance, there will be many time coord variables and data variables with part of the data, and it would be a great nuisance to have to give them all unique variable names. I suggest that the rule should be, when a variable is referred to by name, * if a variable of that name exists within the same file, it is the one meant. * if it doesn't exist in the file, there must be only one variable with this name anywhere in the set of files. However, I also think that using unique identifiers to define groups will sometimes be too restrictive. I would like it to be possible to treat any set of files I choose as a single dataset. For example, since a file with cell measures like area and volume, or a gridspec file, will often be the same for many or all experiments using a given model, I may wish to have only one copy of it, and use it for all the different datasets, even though they have different unique identifiers. Another example is that I might wish to treat data from different ensemble members of an experiment as part of the same dataset, so I can aggregate it with an ensemble axis and then compute stats by collapsing that axis. This flexibility can easily be achieved simply by allowing the option of ignoring the unique identifier, but still allowing references from one file to another by variable name alone. In this more flexible case, how the group of files is identified depends on the software being used, of course. This is the principle of the cdms cdscan tool, for example, which will treat any arbitrary group of files as a single dataset. To summarise, therefore, I suggest that * a CF attribute should be defined to store a unique identifier for a dataset spread across an arbitrary set of files. Within the dataset, other CF metadata must be sufficient to identify each datum uniquely. * CMIP5 should assign such an identifier for each run submitted and CMOR should be able to record it in the files generated. I think these unique IDs would actually be quite handy for keeping track of CMIP5 data. * gridspec could refer to variables by name alone, with no associated_files attribute or checksum being required, as the unique id will serve those purposes. * CF should permit variables to be referred to by name alone in another file, subject to rules (such as proposed above). It is a decision of the data user whether the associated files should be required to have the same unique ID. Best wishes Jonathan _______________________________________________ CF-metadata mailing list [email protected] http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
