Hello Karl

I agree with your analysis that it is unlikely that a data variable will ever 
vary with ensemble_size, so having ensemble_size as a scalar coordinate is 
slightly odd, in that we'd not expect it to be anything other than scalar.
It would meet my use case, but I can see the interest in other options.

It seems to me that the rest of your thoughts are centred around cell_methods.
The conventions describe Cell Methods as:
7.3. Cell Methods

To describe the characteristic of a field that is represented by cell values, 
we define the cell_methods attribute of the variable.

It earlier describes Cells:
7. Data Representative of Cells

When gridded data does not represent the point values of a field but instead 
represents some characteristic of the field within cells of finite "volume," a 
complete description of the variable should include metadata that describes the 
domain or extent of each cell, and the characteristic of the field that the 
cell values represent.

It is not clear to me that the case of defining an ensemble fits with this 
model as described. What is the 'characteristic of the field' within 'cells of 
finite volume' in this case?

Is there appetite to extend the scope of Cell Methods to define such 
characteristics? What are the risks in doing this?  Is this proposal  extending 
Cell Methods into realms which are already nearly covered by other CF concepts?

I don't have coherent answers to these queries, but I think they are worth a 
little thought before we delve too far into the details of encoding

many thanks
mark


________________________________
From: CF-metadata [[email protected]] on behalf of Karl Taylor 
[[email protected]]
Sent: 24 July 2015 01:42
To: [email protected]
Subject: Re: [CF-metadata] original_ensemble_size

Hi all,

This addresses the issue of how to associate an ensemble size with a variable.  
It also suggests an alternate way of proceeding that is more general and will 
allow us to record, for example, which models were included in a multi-model 
mean.

First to consider Jim's suggestion:
I agree with Jim that you might want to indicate which member (or members) of 
an ensemble were represented by the variable so you might want to include a 
coordinate variable of "realization".  You could then also define an 
*attribute* of that coordinate as "ensemble_size" which would record the size, 
but currently that approach is not standardized (but of course is permitted) by 
our conventions.

Now Mark's suggestion:
Mark's alternative approach to make "ensemble_size" a coordinate variable 
(presumably in addition to possibly including "realization") would also relate 
it to the variable of interest, but this would be a bit unconventional since a 
variable would normally be considered to be a *function* of its (independent) 
coordinates.  I don't think T(x,realization,ensemble_size)  is a proper 
function, since T depends on x and realization, but should be independent of 
ensemble size in most cases.

Jonathan's suggestion:
I think Jonathan suggested including ensemble_size in a cell_methods attribute. 
 For example

dimensions:
    lon=72
    lat=96
    e_size=5

variables:
    float precip(lon,lat)
        precip: cell_methods="realization: point (sample_size: e_size)

where because "realization" is a standard name, it does not need to be 
explicitly declared with a "coordinates" attribute.  Jonathan originally used 
"dimension" rather than "sample_size", but I prefer "sample_size".   If this 
approach were followed, then CF would need to be modified so that "sample_size" 
(along with "interval") was designated to be one of the options for providing 
"standardized" extra information in the cell_methods attribute.  Note that the 
variable "pointed to" by original_domain would not necessarily be a coordinate 
variable; it need not be monotonic and it could be a character variable (i.e., 
a list).

Alternative "new approach"

An approach that is a slight variant on Jonathan's and would allow even more 
information to be provided concerning the ensemble is illustrated by the 
following example:

dimensions:
    lon=72
    lat=96
    members=5

variables:
    float precip(lon,lat)
        precip: cell_methods="member: point (sample_pool: members)
    int member
        member: standard_name="realization"
     int members(members)
        members: standard_name="realization"

data:
    member = 3
    members = 1, 3, 5, 6, 10

This would tell you T was from the realization labeled 3 of a 5-member ensemble 
(with labels 1, 3, 5, 6, and 10).  If this approach were adopted, then CF would 
need to be modified so that "sample_pool" (along with "interval") was 
designated to be one of the the options for providing "standardized" extra 
information in the cell_methods attribute.

Under Jonathan's approach and also the "new approach", there wouldn't be a need 
to define the standard_name "ensemble_size" because that would be provided by 
the dimension size (5 in the above).

Note that the new approach could also be used to record a multi-model ensemble 
mean (I'm not absolutely sure this example complies with the current 
convention, but I think it would if  the option to designate the 
"original_domain" were added to CF):

dimensions:
    lon=72
    lat=96
    models=5
    max_len = 10

variables:
    float precip(lon,lat)
        precip: cell_methods="realization: mean (sample_pool: models)
     char models(models, max_len)

data:
    models = "CanESM2", "CESM1", "CNRM-CM5", "HadGEM2", "MIROC-ESM"

Note also that the flexibility of this new approach could be useful for 
dimensions other than realization when, for example, the sampling interval for 
a spatial mean is from scattered stations.  If one were computing an spatial 
mean from 5 stations, for example, this could be recorded as follows:

dimensions:
    stations=5
    max_len=16

variables:
    float precmean
        precmean: cell_methods="area: mean (sample_pool: stations)"
    char stations(stations,max_len)
        stations: coordinates="lat lon"
    lat(stations)
        lat: standard_name="latitude"
    lon(stations)
        lon: standard_name="longitude"

data:
    stations = "Oakland", "San Francisco", "Livermore", "San Jose", "Palo Alto"
    lat = 37.62, 37.77, ...
    lon = -122.27, -122.42, ....

I would find it very nice to be able to specify the models contributing to a 
multi-model mean using the above approach.  Anyone else think so?  It would 
also satisfy Mark's use case of wanting to record the size of the ensemble.

Best regards,
Karl

_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Reply via email to