Hi all,

This addresses the issue of how to associate an ensemble size with a variable. It also suggests an alternate way of proceeding that is more general and will allow us to record, for example, which models were included in a multi-model mean.

First to consider Jim's suggestion:
I agree with Jim that you might want to indicate which member (or members) of an ensemble were represented by the variable so you might want to include a coordinate variable of "realization". You could then also define an *attribute* of that coordinate as "ensemble_size" which would record the size, but currently that approach is not standardized (but of course is permitted) by our conventions.

Now Mark's suggestion:
Mark's alternative approach to make "ensemble_size" a coordinate variable (presumably in addition to possibly including "realization") would also relate it to the variable of interest, but this would be a bit unconventional since a variable would normally be considered to be a *function* of its (independent) coordinates. I don't think T(x,realization,ensemble_size) is a proper function, since T depends on x and realization, but should be independent of ensemble size in most cases.

Jonathan's suggestion:
I think Jonathan suggested including ensemble_size in a cell_methods attribute. For example

dimensions:
    lon=72
    lat=96
    e_size=5

variables:
    float precip(lon,lat)
        precip: cell_methods="realization: point (sample_size: e_size)

where because "realization" is a standard name, it does not need to be explicitly declared with a "coordinates" attribute. Jonathan originally used "dimension" rather than "sample_size", but I prefer "sample_size". If this approach were followed, then CF would need to be modified so that "sample_size" (along with "interval") was designated to be one of the options for providing "standardized" extra information in the cell_methods attribute. Note that the variable "pointed to" by original_domain would not necessarily be a coordinate variable; it need not be monotonic and it could be a character variable (i.e., a list).

Alternative "new approach"

An approach that is a slight variant on Jonathan's and would allow even more information to be provided concerning the ensemble is illustrated by the following example:

dimensions:
    lon=72
    lat=96
    members=5

variables:
    float precip(lon,lat)
        precip: cell_methods="member: point (sample_pool: members)
    int member
        member: standard_name="realization"
     int members(members)
        members: standard_name="realization"

data:
    member = 3
    members = 1, 3, 5, 6, 10

This would tell you T was from the realization labeled 3 of a 5-member ensemble (with labels 1, 3, 5, 6, and 10). If this approach were adopted, then CF would need to be modified so that "sample_pool" (along with "interval") was designated to be one of the the options for providing "standardized" extra information in the cell_methods attribute.

Under Jonathan's approach and also the "new approach", there wouldn't be a need to define the standard_name "ensemble_size" because that would be provided by the dimension size (5 in the above).

Note that the new approach could also be used to record a multi-model ensemble mean (I'm not absolutely sure this example complies with the current convention, but I think it would if the option to designate the "original_domain" were added to CF):

dimensions:
    lon=72
    lat=96
    models=5
    max_len = 10

variables:
    float precip(lon,lat)
        precip: cell_methods="realization: mean (sample_pool: models)
     char models(models, max_len)

data:
    models = "CanESM2", "CESM1", "CNRM-CM5", "HadGEM2", "MIROC-ESM"

Note also that the flexibility of this new approach could be useful for dimensions other than realization when, for example, the sampling interval for a spatial mean is from scattered stations. If one were computing an spatial mean from 5 stations, for example, this could be recorded as follows:

dimensions:
    stations=5
    max_len=16

variables:
    float precmean
        precmean: cell_methods="area: mean (sample_pool: stations)"
    char stations(stations,max_len)
        stations: coordinates="lat lon"
    lat(stations)
        lat: standard_name="latitude"
    lon(stations)
        lon: standard_name="longitude"

data:
stations = "Oakland", "San Francisco", "Livermore", "San Jose", "Palo Alto"
    lat = 37.62, 37.77, ...
    lon = -122.27, -122.42, ....

I would find it very nice to be able to specify the models contributing to a multi-model mean using the above approach. Anyone else think so? It would also satisfy Mark's use case of wanting to record the size of the ensemble.

Best regards,
Karl

_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Reply via email to