Dear Martin and Alejandro (following off-list discussions) > The CF definitions say ''"histogram_of_X[_over_Z]" means histogram (i.e. > number of counts for each range of X) of variations (over Z) of X.'

## Advertising

Yes, that's in the guidelines for construction of standard names, and there are only two of them at present, as you say. The simplest case is when you have some quantity Q depending on only one dimension, Q(Z). Then the histogram H(Q) is the number of values of Q which fall into each interval of Q, considering variation over Z. In general there could be more than one dimension retained, and more than one removed. If the original field was Q(P,Y,Z,T), we might construct a histogram H(Q,Z,T), for instance, containing the frequencies of values of Q falling into joint intervals of Q, Z and T, for variation over P and Y. Following the guideline above, we would call this a histogram of Q over P and Y, I think. It is not necessary to indicate in the standard name the dimensions which the histogram depends on (Z and T in my example) because the coordinate variables (of Z and T) make that clear. Martin suggests that by this argument we could also omit Q from the standard name, and just call it a histogram (or frequency distribution) rather than a histogram of Q, where Q is air temperature, precipitation amount, backscattering ratio, etc. I think there are two reasons why we include Q in the standard name, * I think a histogram of air temperature is not the same geophysical quantity as a histogram of precipitation amount, for instance, so they should be distinguished by standard name. * Although histograms are pure numbers, and so are probabilities, probability densities are not. Histograms, probability distributions and probability density functions are all related ways of expressing the same information. In the guidelines, we foresee that we might need names for all of them (though so far we have only histograms) and it would make sense to give them consistent names. The probability density function of air temperature has units of K-1, and of precipitation amount kg-1 m2, for instance. Because they have different canonical units, they must have different standard names, so Q needs to be included in the standard name. Cell methods describe how the values represent variation within the cells. The transformation from the values of a quantity to a histogram of the quantity makes the original quantity into a dimension. This seems more of a radical transformation than computing a mean or a standard deviation, which doesn't change the dimensions of the variable, but just reduces their size (to unity if completely collapsed). A frequency distribution of Q is regarded as a different geophysical quantity from Q itself, so we have not used cell methods to describe the relationship. Of course, this is a bit arbitrary (like everything else in the CF convention!). I agree with Martin that we could omit the "over" part of the standard name for histograms, probabilities and probability densities. It is useful to retain the collapsed dimensions as size-1 dimensions, so that their original range can be recorded. They could be assigned cell_method of "sum", the default for extensive quantities, because the histogram applies to their entire range. The same applies to the variable with has been histogrammed and is now a dimension; the histogram is a sum for each of its cells. For example, in the 1D case, suppose the original field is air_temperature as a function of time only. Then the histogram variable is float hair(tair); hair:standard_name="histogram_of_air_temperature"; hair:units="1"; hair:cell_methods="time: sum tair: sum"; hair:coordinates="time"; float time; // scalar coordinate variable with bounds float tair(tair); tair:units="K"; As a multidimensional example, suppose the original field is float tair(time,altitude,latitude,longitude); tair:units="K"; tair:standard_name="air_temperature"; tair:cell_methods="altitude: mean area: mean time: mean"; from which we might construct float pair(tair,time,altitude); pair:standard_name="probability_density_function_of_air_temperature"; pair:units="K-1"; pair:cell_methods="altitude: mean time: mean area: sum tair: mean"; pair:coordinates="latitude longitude"; // to record the ranges Here, I suggest that the cell_method for area is "sum", because the PDF applies to the whole area, which is an extensive quantity. For air temperature it seems more sense to interpret a PDF as a mean within cells, since a PDF is an intensive quantity - you can interpolate it, for example - but not a point quantity if it's calculated from a histogram with finite bin-widths. Best wishes Jonathan ----- Forwarded message from martin.juc...@stfc.ac.uk ----- > Date: Wed, 12 Oct 2016 18:05:06 +0000 > From: martin.juc...@stfc.ac.uk > To: cf-metadata@cgd.ucar.edu > Subject: [CF-metadata] Usage of histogram_of_X_over_Z > > Hello, > > There are two standard names of the form histogram_of_..... in the CF > Standard Name list (at version 36): > histogram_of_backscattering_ratio_over_height_above_reference_ellipsoid and > histogram_of_equivalent_reflectivity_factor_over_height_above_reference_ellipsoid. > Both of these where used in CMIP5 and set to be used in CMIP6, but the usage > does not appear to match the standard name desecriptions. > > The possible confusion is over the role of different coordinates. The CF > definitions say ''"histogram_of_X[_over_Z]" means histogram (i.e. number of > counts for each range of X) of variations (over Z) of X.' This implies to me > that you start with a function of Z and possibly other coordinates and end up > with a function of X and the other coordinates. E.g. if the source data is > X(lat,lon,Z), then the histogram data will be of the form > frequency(lat,lon,X). > > In the two CMIP5/CMIP6 draft variables (cfadLidarsr532, cfadDbze94) using > these standard names the "Z" coordinate which is included in the standard > name ("height_above_reference_ellipsoid") is one of the coordinates of the > histogram data variable. Both these variables appear to be joint > distributions (frequency of X and Y values) over sub-grid variability as a > function of latitude, longitude and time. > > I've been reviewing these existing definitions in some detail because there > are some new distribution variables in the request and I'd like to make sure > that we have a consistent approach. > > If we need to described a variable which carries a joint distribution of X > and Y, then the variable will have to use X and Y as coordinates, so perhaps > we can simplify the process by leaving them out of the standard name. > Similarly the "over_Z" part of the name would be better expressed as a > cell_methods construct. This line of reasoning suggests using a new standard > name such as "frequency_distribution" (units "1"). The only difficulty is > that the frequency distribution might be a function of the quantities X and Y > (scattering ratio and cloud top height for cfadLidarsr532) and also of > latitude, longitude and time. There should be some way of distinguishing the > different roles of these 5 coordinates: is is the distribution of X and Y as > a function of latitude, longitude and time. I think this could be done > conveniently by introducing a single new attribute, e.g. "bin_coords: X Y". > > "frequency_distribution" could be used for single or joint distributions. > > My questions to the list are: > (1) am I missing something in my interpretation of the existing > histogram_of_... names? > (2) if not, is the adoption of a "frequency_distribution" standard name an > appropriate way forward? > > regards, > Martin > > regards, > Martin > _______________________________________________ > CF-metadata mailing list > CF-metadata@cgd.ucar.edu > http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata ----- End forwarded message ----- _______________________________________________ CF-metadata mailing list CF-metadata@cgd.ucar.edu http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata