Dear Ken

I have had time at last to study and think a bit about your detailed proposal. 
Thank you for preparing and presenting it. I appreciate it's frustrating for 
you that this issue is going slowly. Speaking for myself and from David's 
comments too, I believe this is because it is a large and complicated proposal; 
when you're busy (as we all are), it's hard to create a large enough chunk of 
time to address something requiring lengthy thought. Things might go faster if 
we dealt with it a piece at a time.

I formed my opinions before reading David's, and I find (without surprise) that 
many of them are the same. Like David, I'm grateful for your link to the 
[GUM](https://urldefense.us/v3/__https://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf__;!!G2kpM7uM-TzIFchu!l91lQuAvrnyEcw2i_0gvpgd5pQQIeNCsYJe1oKd9V9FkUVhZ86NZkE9Yi8y9bUGNaxFBWlC0-e4$
 ). I too agree with your approach of using ancillary variables to contain 
measures of uncertainty. The CF standard (section 3.4) doesn't say what 
dimensions ancillary variables should have. Since they're intended to provide 
metadata about individual values of a data variable, they would normally have 
all the same dimensions. However, I don't think it would be problematic to 
allow dimensions to be dropped over which the uncertainty doesn't vary. You 
could drop all the dimensions to provide a scalar uncertainty, as in your 
examples.

I don't think that standard names are the right way to describe the 
uncertainties, because the standard name should still identify the geophysical 
quantity for which it is an uncertainty e.g. `air_temperature`, and because 
each standard name requires particular canonical units, whereas the 
uncertainties have the same units as the data.

David mentioned that your proposal requires ancillary variables themselves to 
have ancillary variables. I didn't notice an instance of that in the examples - 
is there one?

The earlier long and detailed 
[discussion](https://urldefense.us/v3/__http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2013/006106.html__;!!G2kpM7uM-TzIFchu!l91lQuAvrnyEcw2i_0gvpgd5pQQIeNCsYJe1oKd9V9FkUVhZ86NZkE9Yi8y9bUGNaxFBh5lZ0XQ$
 ) of 2013, which David referenced, is certainly very relevant to your 
proposal, regarding the distinction between `cell_methods` and standard name 
modifiers. Two of the four standard name modifiers (`number_of_observations` 
and `status_flag`) are now deprecated, in favour of using them as standard 
names instead. That is fine because they don't have units. The other two 
(`detection_minimum` and `standard_error`) are uncertainty measures, and hence 
relate to your proposal particularly. In order not to complicated the standard 
and software, it is one of the CF principles that we don't introduce a new way 
to do something we can already do, even if the new way is agreed to be better, 
but even so I would be happy if your proposal provided an alternative and 
better framework for these measures!

Since ancillary variables are like data variables, I think we could allow them 
to have `cell_methods`. As in the discussion of 2013, I believe that 
`cell_methods` would be a good place to identify the variable as a measure of 
uncertainty. This would mean expanding the idea of what cell methods is for. At 
the moment its role is to describe how the data represents statistical 
variation of the geophysical quantity within the cells. It seems to me that 
this can encompass uncertainty as well if we regard that as being variation 
over different realisations of the cells.

If the uncertainty comes from repeated measurement of a quantity with the same 
spatiotemporal coordinates, you might really add a dimension which runs over 
the individual measurements. This is exactly like an ensemble of model runs 
e.g. `float air_temperature(time,lat,lon,realization)`, where `realization` is 
the sample dimension. Then if you calculated the standard deviation of the 
sample in each spatiotemporal cell, it would have `cell_methods="realization: 
standard_deviation"`. The collapsed realization dimension, now of size 1, could 
be dropped, because `realization` is also a standard name, and hence the 
`cell_methods` implies that a standard deviation was computed over the entire 
set of realizations, about which no information is retained (Section 7.3.4).

Most of your examples of uncertainty are mathematically described as standard 
deviations. I think they are actually standard errors in the statistical sense: 
"The standard error (SE) of a statistic is the standard deviation of its 
sampling distribution or an estimate of that standard deviation" (wikipedia). I 
note that the GUM doesn't use that term, and probably "experimental standard 
deviation" is the same concept, isn't it? I think it's confusing to call it a 
standard deviation, however, because it is not the SD of the sample; it's 
divided by sqrt(N). I would prefer `standard_error` as a new `cell_method`, 
also for consistency with the standard name modifier that has the same meaning, 
and allowing us to use the existing `standard_error_multiplier` attribute, as 
David mentioned, instead of a standardised comment in cell methods, as you 
suggest.

All the above leads me to suggest a syntax such as `cell_methods="uncertainty: 
standard_error"` for an uncertainty that is a mathematically treated as a SD, 
like most of your examples. In this syntax, `standard_error` would be a new 
cell method, and `uncertainty` would be a new special keyword, rather like 
`realization` in meaning, as above, but not requiring the idea of a collapsed 
dimension.

You would also like to be able to provide intervals when not symmetrical. That 
could be done by adding a size-one dimension for probability or percentile, 
with bounds to specify the interval e.g. 
`air_temperature(time,lat,lon,probability)`, where `probability` is a size-one 
coordinate or scalar coordinate variable. This could be identified with a 
syntax such as `cell_methods="probability: expanded_uncertainty"`. I think 
that's the term the GUM uses, isn't it? It could also be called e.g. 
`uncertainty_bounds`. The GUM deprecates "confidence interval". An interval 
which contains all conceivable values is one which spans probability 0.0 to 1.0.

So far this is all about describing the mathematical nature of the uncertainty. 
You also want to describe what it represents. You do this with the standard 
name, which David and I both think wouldn't work. Could you do this with 
standardised comments in the cell methods? For instance, you could add 
`(statistical)` and `(subjective)` for the GUM's Type A and B. The GUM says, "a 
Type A standard uncertainty is obtained from a probability density function 
(C.2.5) derived from an observed frequency distribution, while a Type B 
standard uncertainty is obtained from an assumed probability density function 
based on the degree of belief that an event will occur, often called subjective 
probability. Both approaches employ recognized interpretations of probability." 
I think that if the uncertainty is unqualified it should be assumed to be the 
"combined" or total uncertainty. That is consistent with the convention in CF 
standard names that an unqualified name means everything is included.

I think that's enough for now! I wonder what you think.

Best wishes

Jonathan


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://urldefense.us/v3/__https://github.com/cf-convention/cf-conventions/issues/320*issuecomment-901307274__;Iw!!G2kpM7uM-TzIFchu!l91lQuAvrnyEcw2i_0gvpgd5pQQIeNCsYJe1oKd9V9FkUVhZ86NZkE9Yi8y9bUGNaxFBLpDcKx8$
 
This list forwards relevant notifications from Github.  It is distinct from 
[email protected], although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
[email protected].

Reply via email to