Re: [CF-metadata] Usage of histogram_of_X_over_Z

Jonathan Gregory Thu, 27 Oct 2016 09:09:35 -0700

Dear Martin

> In broad usage, I have the impression that a "histogram" can be expressed as 
> either a count or a percentage, so we should be explicit in the convention if 
> we want a narrower definition here. A narrower definition is probably needed, 
> as there would otherwise be no way of distinguishing between the two.


I agree with that but the idea is that a standard name of histogram would be
for a count, while probability would be for a fraction. The latter could be
0-1 or 0-100% - they are dimensionally equivalent but different units. We
could clarify that in the guidelines.

> There are two further CMIP variables, both or which are bi-variate 
> distributions, with bins of spectral bands and cloud top height ranges, which 
> I'd like to bring into the discussion, but it might be useful to transfer the 
> conclusions of the exchange so far into a ticket first. I think the two 
> additional variables could be covered by a simple extension to 
> "probability_density_function_of_X_and_Y" ... though you might want to insert 
> "joint_" at the beginning of the term.

OK, that's interesting. I agree that it would fit.

Best wishes

Jonathan
> 
> Dear Martin and Alejandro  (following off-list discussions)
> 
> > The CF definitions say ''"histogram_of_X[_over_Z]" means histogram (i.e. 
> > number of counts for each range of X) of variations (over Z) of X.'
> 
> Yes, that's in the guidelines for construction of standard names, and there
> are only two of them at present, as you say. The simplest case is when you
> have some quantity Q depending on only one dimension, Q(Z). Then the histogram
> H(Q) is the number of values of Q which fall into each interval of Q,
> considering variation over Z.  In general there could be more than one
> dimension retained, and more than one removed. If the original field was
> Q(P,Y,Z,T), we might construct a histogram H(Q,Z,T), for instance, containing
> the frequencies of values of Q falling into joint intervals of Q, Z and T, for
> variation over P and Y. Following the guideline above, we would call this a
> histogram of Q over P and Y, I think.
> 
> It is not necessary to indicate in the standard name the dimensions which
> the histogram depends on (Z and T in my example) because the coordinate
> variables (of Z and T) make that clear. Martin suggests that by this argument
> we could also omit Q from the standard name, and just call it a histogram
> (or frequency distribution) rather than a histogram of Q, where Q is air
> temperature, precipitation amount, backscattering ratio, etc. I think there
> are two reasons why we include Q in the standard name,
> 
> * I think a histogram of air temperature is not the same geophysical quantity
> as a histogram of precipitation amount, for instance, so they should be
> distinguished by standard name.
> 
> * Although histograms are pure numbers, and so are probabilities, probability
> densities are not. Histograms, probability distributions and probability
> density functions are all related ways of expressing the same information.
> In the guidelines, we foresee that we might need names for all of them (though
> so far we have only histograms) and it would make sense to give them 
> consistent
> names. The probability density function of air temperature has units of K-1,
> and of precipitation amount kg-1 m2, for instance. Because they have different
> canonical units, they must have different standard names, so Q needs to be
> included in the standard name.
> 
> Cell methods describe how the values represent variation within the cells.
> The transformation from the values of a quantity to a histogram of the
> quantity makes the original quantity into a dimension. This seems more of
> a radical transformation than computing a mean or a standard deviation, which
> doesn't change the dimensions of the variable, but just reduces their size
> (to unity if completely collapsed). A frequency distribution of Q is
> regarded as a different geophysical quantity from Q itself, so we have not
> used cell methods to describe the relationship. Of course, this is a bit
> arbitrary (like everything else in the CF convention!).
> 
> I agree with Martin that we could omit the "over" part of the standard name 
> for
> histograms, probabilities and probability densities. It is useful to retain 
> the
> collapsed dimensions as size-1 dimensions, so that their original range can
> be recorded. They could be assigned cell_method of "sum", the default for
> extensive quantities, because the histogram applies to their entire range.
> The same applies to the variable with has been histogrammed and is now a
> dimension; the histogram is a sum for each of its cells.
> 
> For example, in the 1D case, suppose the original field is air_temperature
> as a function of time only. Then the histogram variable is
>   float hair(tair);
>     hair:standard_name="histogram_of_air_temperature";
>     hair:units="1";
>     hair:cell_methods="time: sum tair: sum";
>     hair:coordinates="time";
>   float time; // scalar coordinate variable with bounds
>   float tair(tair);
>     tair:units="K";
> 
> As a multidimensional example, suppose the original field is
>   float tair(time,altitude,latitude,longitude);
>     tair:units="K";
>     tair:standard_name="air_temperature";
>     tair:cell_methods="altitude: mean area: mean time: mean";
> from which we might construct
>   float pair(tair,time,altitude);
>     pair:standard_name="probability_density_function_of_air_temperature";
>     pair:units="K-1";
>     pair:cell_methods="altitude: mean time: mean area: sum tair: mean";
>     pair:coordinates="latitude longitude"; // to record the ranges
> Here, I suggest that the cell_method for area is "sum", because the PDF
> applies to the whole area, which is an extensive quantity. For air temperature
> it seems more sense to interpret a PDF as a mean within cells, since a PDF is
> an intensive quantity - you can interpolate it, for example - but not a point
> quantity if it's calculated from a histogram with finite bin-widths.
> 
> Best wishes
> 
> Jonathan
> 
> _______________________________________________
> CF-metadata mailing list
> [email protected]
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

----- End forwarded message -----
_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Re: [CF-metadata] Usage of histogram_of_X_over_Z

Reply via email to