Re: [CF-metadata] [cf-convention/cf-conventions] How to Report Uncertainty Chapter (#320)

JonathanGregory Wed, 18 Aug 2021 10:46:45 -0700

Dear Ken

I have had time at last to study and think a bit about your detailed proposal. 
Thank you for preparing and presenting it. I appreciate it's frustrating for 
you that this issue is going slowly. Speaking for myself and from David's 
comments too, I believe this is because it is a large and complicated proposal; 
when you're busy (as we all are), it's hard to create a large enough chunk of 
time to address something requiring lengthy thought. Things might go faster if 
we dealt with it a piece at a time.

I formed my opinions before reading David's, and I find (without surprise) that
many of them are the same. Like David, I'm grateful for your link to the
[GUM](https://urldefense.us/v3/__https://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf__;!!G2kpM7uM-TzIFchu!l91lQuAvrnyEcw2i_0gvpgd5pQQIeNCsYJe1oKd9V9FkUVhZ86NZkE9Yi8y9bUGNaxFBWlC0-e4$
). I too agree with your approach of using ancillary variables to contain
measures of uncertainty. The CF standard (section 3.4) doesn't say what
dimensions ancillary variables should have. Since they're intended to provide
metadata about individual values of a data variable, they would normally have
all the same dimensions. However, I don't think it would be problematic to
allow dimensions to be dropped over which the uncertainty doesn't vary. You
could drop all the dimensions to provide a scalar uncertainty, as in your
examples.

I don't think that standard names are the right way to describe the
uncertainties, because the standard name should still identify the geophysical
quantity for which it is an uncertainty e.g. `air_temperature`, and because
each standard name requires particular canonical units, whereas the
uncertainties have the same units as the data.

David mentioned that your proposal requires ancillary variables themselves to
have ancillary variables. I didn't notice an instance of that in the examples -
is there one?

The earlier long and detailed
[discussion](https://urldefense.us/v3/__http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2013/006106.html__;!!G2kpM7uM-TzIFchu!l91lQuAvrnyEcw2i_0gvpgd5pQQIeNCsYJe1oKd9V9FkUVhZ86NZkE9Yi8y9bUGNaxFBh5lZ0XQ$
) of 2013, which David referenced, is certainly very relevant to your
proposal, regarding the distinction between `cell_methods` and standard name
modifiers. Two of the four standard name modifiers (`number_of_observations`
and `status_flag`) are now deprecated, in favour of using them as standard
names instead. That is fine because they don't have units. The other two
(`detection_minimum` and `standard_error`) are uncertainty measures, and hence
relate to your proposal particularly. In order not to complicated the standard
and software, it is one of the CF principles that we don't introduce a new way
to do something we can already do, even if the new way is agreed to be better,
but even so I would be happy if your proposal provided an alternative and
better framework for these measures!

Since ancillary variables are like data variables, I think we could allow them
to have `cell_methods`. As in the discussion of 2013, I believe that
`cell_methods` would be a good place to identify the variable as a measure of
uncertainty. This would mean expanding the idea of what cell methods is for. At
the moment its role is to describe how the data represents statistical
variation of the geophysical quantity within the cells. It seems to me that
this can encompass uncertainty as well if we regard that as being variation
over different realisations of the cells.

If the uncertainty comes from repeated measurement of a quantity with the same
spatiotemporal coordinates, you might really add a dimension which runs over
the individual measurements. This is exactly like an ensemble of model runs
e.g. `float air_temperature(time,lat,lon,realization)`, where `realization` is
the sample dimension. Then if you calculated the standard deviation of the
sample in each spatiotemporal cell, it would have `cell_methods="realization:
standard_deviation"`. The collapsed realization dimension, now of size 1, could
be dropped, because `realization` is also a standard name, and hence the
`cell_methods` implies that a standard deviation was computed over the entire
set of realizations, about which no information is retained (Section 7.3.4).

Most of your examples of uncertainty are mathematically described as standard
deviations. I think they are actually standard errors in the statistical sense:
"The standard error (SE) of a statistic is the standard deviation of its
sampling distribution or an estimate of that standard deviation" (wikipedia). I
note that the GUM doesn't use that term, and probably "experimental standard
deviation" is the same concept, isn't it? I think it's confusing to call it a
standard deviation, however, because it is not the SD of the sample; it's
divided by sqrt(N). I would prefer `standard_error` as a new `cell_method`,
also for consistency with the standard name modifier that has the same meaning,
and allowing us to use the existing `standard_error_multiplier` attribute, as
David mentioned, instead of a standardised comment in cell methods, as you
suggest.

All the above leads me to suggest a syntax such as `cell_methods="uncertainty:
standard_error"` for an uncertainty that is a mathematically treated as a SD,
like most of your examples. In this syntax, `standard_error` would be a new
cell method, and `uncertainty` would be a new special keyword, rather like
`realization` in meaning, as above, but not requiring the idea of a collapsed
dimension.

You would also like to be able to provide intervals when not symmetrical. That
could be done by adding a size-one dimension for probability or percentile,
with bounds to specify the interval e.g.
`air_temperature(time,lat,lon,probability)`, where `probability` is a size-one
coordinate or scalar coordinate variable. This could be identified with a
syntax such as `cell_methods="probability: expanded_uncertainty"`. I think
that's the term the GUM uses, isn't it? It could also be called e.g.
`uncertainty_bounds`. The GUM deprecates "confidence interval". An interval
which contains all conceivable values is one which spans probability 0.0 to 1.0.

So far this is all about describing the mathematical nature of the uncertainty.
You also want to describe what it represents. You do this with the standard
name, which David and I both think wouldn't work. Could you do this with
standardised comments in the cell methods? For instance, you could add
`(statistical)` and `(subjective)` for the GUM's Type A and B. The GUM says, "a
Type A standard uncertainty is obtained from a probability density function
(C.2.5) derived from an observed frequency distribution, while a Type B
standard uncertainty is obtained from an assumed probability density function
based on the degree of belief that an event will occur, often called subjective
probability. Both approaches employ recognized interpretations of probability."
I think that if the uncertainty is unqualified it should be assumed to be the
"combined" or total uncertainty. That is consistent with the convention in CF
standard names that an unqualified name means everything is included.

I think that's enough for now! I wonder what you think.

Best wishes

Jonathan

--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://urldefense.us/v3/__https://github.com/cf-convention/cf-conventions/issues/320*issuecomment-901307274__;Iw!!G2kpM7uM-TzIFchu!l91lQuAvrnyEcw2i_0gvpgd5pQQIeNCsYJe1oKd9V9FkUVhZ86NZkE9Yi8y9bUGNaxFBLpDcKx8$

This list forwards relevant notifications from Github. It is distinct from
[email protected], although if you do nothing, a subscription to the
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to
[email protected].

Re: [CF-metadata] [cf-convention/cf-conventions] How to Report Uncertainty Chapter (#320)

Reply via email to