Re: [CF-metadata] Missing data bins in histograms

2019-05-14 Thread Martin Juckes - UKRI STFC
Hi Dan,


if we were starting from a blank sheet, that would be a strong point. As it is, 
we are rather constrained by the existing practices in the community. I hope 
that we can find an agreement along the lines of the discussion that Jonathan 
and I are having which makes it possible to support this approach without major 
adjustment.


This is likely (if we succeed) to include presentation of a new example in the 
conventions document. Perhaps we could, at the same time, include and example 
showing the alternative approach which you are suggesting -- but  that would 
depend on having a standard name for the number of missing or rejected 
observations approved.


regards,

Martin



From: Hollis, Dan 
Sent: 14 May 2019 16:02
To: Juckes, Martin (STFC,RAL,RALSP); Gregory, Jonathan; 
cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: RE: [CF-metadata] Missing data bins in histograms

Hi Martin,

I agree there is no clear line between data and metadata and I didn't really 
intend to suggest there was one. As you say, there are different equally-valid 
views of where the line could/should be drawn in any particular situation 
between the different types of data that we wish to record. My instinct would 
be to separate the result of processing the available data (whether that be a 
mean, a total, a count or a histogram) from information about the data that was 
not available (such as a count of missing observations), but I appreciate that 
is not always necessary or practical.

Regards,

Dan


-Original Message-
From: Martin Juckes - UKRI STFC 
Sent: Tuesday, 14 May 2019 15:04
To: Hollis, Dan ; Gregory, Jonathan 
; cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: Re: [CF-metadata] Missing data bins in histograms

Hi Dan,


Thanks, that makes it clearer.


The conversation below follows on from one that Karl and I had with people from 
CFMIP (Cloud Forcing Model Intercomparison Project). The variable in question, 
contains the histogram, is produced to make it possible to compare climate 
model output with a standard product from the MISR imaging spectrometer.


I realise now that I have overlooked a change in the variable definition: 
although the product is computed as a histogram, the results are then 
normalised by total number of observations in each grid cell and reported as a 
percentage, so the actual variable name is 
cloud_area_fraction_in_atmosphere_layer rather than histogram. Their standard 
product has 16 bins: 15 for height ranges and one for the error flag.


When Karl and I started the conversation, one of us did suggest splitting the 
16th bin off into a separate variable, but this was considered as being an 
unwarranted complication: the variable is produced by one software package as a 
single array and used by a range of data analysis packages as a single array. 
Splitting it into two in the NetCDF file and then reassembling the parts 
afterwards would create significant extra work that nobody wants to do.


A considerable volume of data has already been written in the CMIP5 archive 
using this approach, with no CF metadata to inform people of the special nature 
of the 16th bin: the aim here is to improve on that state of affairs by 
providing specific metadata.


I would say that your view of the count of missing values as ancillary data is 
a valid perspective, but the suggestion that you are able to draw a clear line 
between "data" and "metadata" and that this perspective should become standard 
is not tenable. The perspective that counts of error flags are just as much 
data as counts of the other height range bins is also valid.


regards,

Martin



From: Hollis, Dan 
Sent: 14 May 2019 13:47
To: Juckes, Martin (STFC,RAL,RALSP); Gregory, Jonathan; 
cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: RE: [CF-metadata] Missing data bins in histograms

Hi Martin,

Sorry, I didn't mean to imply that we would do away with the histogram standard 
names - these would be retained, of course. I just meant that we both want to 
store one extra bit of information (maximum number of obs or, equivalently, 
missing number of obs) and that in both use cases ('histogram_of...' and 
'number_of...') this could be in an ancillary variable, for which we'd need a 
new standard name. Does that make more sense?

I appreciate that your users wish to display the number of missing values 
alongside the counts for the different bins, however I'd argue that this 
information is ancillary to the histogram itself (in the same way that the 
number of missing days is ancillary to a count of days of air frost) and should 
be stored as such in the netCDF file (rather than in a 'pseudo-bin').

Regards,

Dan


-Original Message-
From: Martin Juckes - UKRI STFC 
Sent: Tuesday, 14 May 2019 13:29
To: Hollis, Dan ; Gregory, Jonathan 
; cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: Re: [CF-metadata] Missing data bins in 

Re: [CF-metadata] Missing data bins in histograms

2019-05-14 Thread Hollis, Dan
Hi Martin,

I agree there is no clear line between data and metadata and I didn't really 
intend to suggest there was one. As you say, there are different equally-valid 
views of where the line could/should be drawn in any particular situation 
between the different types of data that we wish to record. My instinct would 
be to separate the result of processing the available data (whether that be a 
mean, a total, a count or a histogram) from information about the data that was 
not available (such as a count of missing observations), but I appreciate that 
is not always necessary or practical.

Regards,

Dan


-Original Message-
From: Martin Juckes - UKRI STFC  
Sent: Tuesday, 14 May 2019 15:04
To: Hollis, Dan ; Gregory, Jonathan 
; cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: Re: [CF-metadata] Missing data bins in histograms

Hi Dan,


Thanks, that makes it clearer.


The conversation below follows on from one that Karl and I had with people from 
CFMIP (Cloud Forcing Model Intercomparison Project). The variable in question, 
contains the histogram, is produced to make it possible to compare climate 
model output with a standard product from the MISR imaging spectrometer.


I realise now that I have overlooked a change in the variable definition: 
although the product is computed as a histogram, the results are then 
normalised by total number of observations in each grid cell and reported as a 
percentage, so the actual variable name is 
cloud_area_fraction_in_atmosphere_layer rather than histogram. Their standard 
product has 16 bins: 15 for height ranges and one for the error flag.


When Karl and I started the conversation, one of us did suggest splitting the 
16th bin off into a separate variable, but this was considered as being an 
unwarranted complication: the variable is produced by one software package as a 
single array and used by a range of data analysis packages as a single array. 
Splitting it into two in the NetCDF file and then reassembling the parts 
afterwards would create significant extra work that nobody wants to do.


A considerable volume of data has already been written in the CMIP5 archive 
using this approach, with no CF metadata to inform people of the special nature 
of the 16th bin: the aim here is to improve on that state of affairs by 
providing specific metadata.


I would say that your view of the count of missing values as ancillary data is 
a valid perspective, but the suggestion that you are able to draw a clear line 
between "data" and "metadata" and that this perspective should become standard 
is not tenable. The perspective that counts of error flags are just as much 
data as counts of the other height range bins is also valid.


regards,

Martin



From: Hollis, Dan 
Sent: 14 May 2019 13:47
To: Juckes, Martin (STFC,RAL,RALSP); Gregory, Jonathan; 
cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: RE: [CF-metadata] Missing data bins in histograms

Hi Martin,

Sorry, I didn't mean to imply that we would do away with the histogram standard 
names - these would be retained, of course. I just meant that we both want to 
store one extra bit of information (maximum number of obs or, equivalently, 
missing number of obs) and that in both use cases ('histogram_of...' and 
'number_of...') this could be in an ancillary variable, for which we'd need a 
new standard name. Does that make more sense?

I appreciate that your users wish to display the number of missing values 
alongside the counts for the different bins, however I'd argue that this 
information is ancillary to the histogram itself (in the same way that the 
number of missing days is ancillary to a count of days of air frost) and should 
be stored as such in the netCDF file (rather than in a 'pseudo-bin').

Regards,

Dan


-Original Message-
From: Martin Juckes - UKRI STFC 
Sent: Tuesday, 14 May 2019 13:29
To: Hollis, Dan ; Gregory, Jonathan 
; cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: Re: [CF-metadata] Missing data bins in histograms

Hi Dan,


it is a similar concept, but the aim here is to record it in a histogram. We 
have a standard name for the histogram  .. I'm not sure why you think we need 
to change this. Perhaps it would be possible to do away with "histogram_" 
standard names and just use "number_of_observations", but I'm afraid I don't 
see much merit in that approach.


For you use case, I can certainly see that there could be a case for a 
"number_of_missing_observations" standard name, but it doesn't help with the 
specification of the histogram that I want to store.


regards,

Martin


From: Hollis, Dan 
Sent: 14 May 2019 13:13
To: Juckes, Martin (STFC,RAL,RALSP); Gregory, Jonathan; 
cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: RE: [CF-metadata] Missing data bins in histograms

Hi Martin,

Thanks for your suggestion - I can see how this could work for our data. 
However I can also see 

Re: [CF-metadata] Missing data bins in histograms

2019-05-14 Thread Martin Juckes - UKRI STFC
Dear Jonathan,


Sorry, I think I misunderstood the scope of valid usage of "flag_values". I've 
only seen it used in contexts in which all values of the flagged array are 
translated using the "flag_values"/"flag_meanings" pairs, but you are 
suggesting, I think, that it should only apply to the one anomalous bin. If we 
can use a single "flag_values" without changing the interpretation of the rest 
of the array, that would make the solution easier.


Does this correspond to what you are thinking of:


float data(time,lat,lon,zbins);
  data: standard_name =   
"histogram_of_equivalent_reflectivity_factor_over_height_above_reference_ellipsoid";
  data: coordinates="status";
float zbins(zbins);
  zbins: long_name="Height ranges (with bin for missing data at first element)"
  zbins: units="m";
  zbins: bounds="zbin_bnds";
  zbins: standard_name = "height";

  zbins:flag_values =  -.f;
  zbins:flag_meanings = "missing_values";
float zbin_bnds(zindex,2);
character status(char_len);
   status:standard_name = "status_flag";
   status:long_name = "Flag indicating quality of histogram";
float lat(lat);
float lon(lon);

data:
  zbins = -., 25., 100., ;
  zbin_bnds = -.,0., 0., 50., 50., 150., ...

regards,
Martin

From: CF-metadata  on behalf of Jonathan 
Gregory 
Sent: 14 May 2019 13:43
To: cf-metadata@cgd.ucar.edu
Subject: Re: [CF-metadata] Missing data bins in histograms

Dear Martin

I agree that if valid_range implies masked-out data in some software, we can't
put special values out of the range, and that we shouldn't tamper with missing
data. I still think that flag_values is a better way to indicate special
values in a coordinate variable than an auxiliary coordinate variable would be.
If there are flag values, by definition those values aren't physical coordinate
values, and the user of such data need to be aware of that. That would be the
consequence of changing the convention to allow flag_values for coordinate
variables, just as it is presently the case that a user of a data variable
ought to check whether it has flag_values, which would likewise indicate that
some of the valid values are not actually physical values. However I don't
think we ought to change the standard_name to signal it, since introducing new
standard_names requires software to recognise both versions.

Best wishes

Jonathan

- Forwarded message from Martin Juckes - UKRI STFC 
 -

> Date: Tue, 14 May 2019 09:03:19 +
> From: Martin Juckes - UKRI STFC 
> To: Jonathan Gregory , "cf-metadata@cgd.ucar.edu"
>
> Subject: Re: [CF-metadata] Missing data bins in histograms
>
> Dear Jonathan,
>
>
> I looked at "valid_range", and also "actual_range", but I believe that the 
> definitions of either of these would have to be changed to accommodate this 
> usage, and we would run into the problem that Jim raised in connection with 
> my earlier suggestion of using "missing_value": such changes can break 
> assumptions made by existing software. Data outside the "valid_range" may 
> well be automatically rejected by a user application before the data gets to 
> any CF aware libraries. For instance, python netCDF4 at version 1.3.0 and 
> 1.3.1 automatically removes data outside the valid_range, giving the user a 
> masked array.  There is some discussion of this here: 
> https://github.com/Unidata/netcdf4-python/issues/748.
>
>
> It is possible to 
> circumvent this behaviour by changing the auto-masking setting in python 
> netCDF4, and the NUG does suggest using values outside the "valid_range" as 
> flags. NUG also suggests using the missing_value attribute to list such flag 
> values ... but Jim has pointed out that such an approach is likely to cause 
> problems with many applications. This is a complex area because the meaning 
> of "missing_value" in NUG has evolved. Up until CF 1.5 it appears that a 
> "missing_value" meant, unambiguously, missing data.  The current CF appears 
> to changed this in line with NUG so that different usages are now 
> permissible, but I still agree with Jim's objection. We can't, I'm sure, at 
> this stage, follow an approach which depends on users being able to control 
> the auto-masking settings (it is a simple call to the "set_auto_mask" method 
> if you are using the python netCDF4 library directly ... but may not be 
> available to users who are working with applications built on the library).
>
>
> I wanted to use a new standard name for the hight bins because of the fact 
> that the value in the first bin, which I have set to -., is not a height. 
> This data point needs to have a valid floating point value to conform to the 
> rules for a coordinate array, but, unlike the rest of the array, it should 
> not be interpreted as height. This is signalled by the presence of an 
> auxiliary coordinate -- but I'm not sure that that is adequate. Applications 
> and users are entitled to 

Re: [CF-metadata] Missing data bins in histograms

2019-05-14 Thread Martin Juckes - UKRI STFC
Hi Dan,


Thanks, that makes it clearer.


The conversation below follows on from one that Karl and I had with people from 
CFMIP (Cloud Forcing Model Intercomparison Project). The variable in question, 
contains the histogram, is produced to make it possible to compare climate 
model output with a standard product from the MISR imaging spectrometer.


I realise now that I have overlooked a change in the variable definition: 
although the product is computed as a histogram, the results are then 
normalised by total number of observations in each grid cell and reported as a 
percentage, so the actual variable name is 
cloud_area_fraction_in_atmosphere_layer rather than histogram. Their standard 
product has 16 bins: 15 for height ranges and one for the error flag.


When Karl and I started the conversation, one of us did suggest splitting the 
16th bin off into a separate variable, but this was considered as being an 
unwarranted complication: the variable is produced by one software package as a 
single array and used by a range of data analysis packages as a single array. 
Splitting it into two in the NetCDF file and then reassembling the parts 
afterwards would create significant extra work that nobody wants to do.


A considerable volume of data has already been written in the CMIP5 archive 
using this approach, with no CF metadata to inform people of the special nature 
of the 16th bin: the aim here is to improve on that state of affairs by 
providing specific metadata.


I would say that your view of the count of missing values as ancillary data is 
a valid perspective, but the suggestion that you are able to draw a clear line 
between "data" and "metadata" and that this perspective should become standard 
is not tenable. The perspective that counts of error flags are just as much 
data as counts of the other height range bins is also valid.


regards,

Martin



From: Hollis, Dan 
Sent: 14 May 2019 13:47
To: Juckes, Martin (STFC,RAL,RALSP); Gregory, Jonathan; 
cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: RE: [CF-metadata] Missing data bins in histograms

Hi Martin,

Sorry, I didn't mean to imply that we would do away with the histogram standard 
names - these would be retained, of course. I just meant that we both want to 
store one extra bit of information (maximum number of obs or, equivalently, 
missing number of obs) and that in both use cases ('histogram_of...' and 
'number_of...') this could be in an ancillary variable, for which we'd need a 
new standard name. Does that make more sense?

I appreciate that your users wish to display the number of missing values 
alongside the counts for the different bins, however I'd argue that this 
information is ancillary to the histogram itself (in the same way that the 
number of missing days is ancillary to a count of days of air frost) and should 
be stored as such in the netCDF file (rather than in a 'pseudo-bin').

Regards,

Dan


-Original Message-
From: Martin Juckes - UKRI STFC 
Sent: Tuesday, 14 May 2019 13:29
To: Hollis, Dan ; Gregory, Jonathan 
; cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: Re: [CF-metadata] Missing data bins in histograms

Hi Dan,


it is a similar concept, but the aim here is to record it in a histogram. We 
have a standard name for the histogram  .. I'm not sure why you think we need 
to change this. Perhaps it would be possible to do away with "histogram_" 
standard names and just use "number_of_observations", but I'm afraid I don't 
see much merit in that approach.


For you use case, I can certainly see that there could be a case for a 
"number_of_missing_observations" standard name, but it doesn't help with the 
specification of the histogram that I want to store.


regards,

Martin


From: Hollis, Dan 
Sent: 14 May 2019 13:13
To: Juckes, Martin (STFC,RAL,RALSP); Gregory, Jonathan; 
cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: RE: [CF-metadata] Missing data bins in histograms

Hi Martin,

Thanks for your suggestion - I can see how this could work for our data. 
However I can also see that having to parse the 'interval' text from the 
'cell_methods' comment field and combine that with the bounds from the time 
coordinate is not especially user-friendly! It would be much easier if we could 
store 'maximum_number_of_observations' (or 'number_of_missing_observations') as 
well.

I guess the reason your suggestion does not work for your histograms is that 
there is no obvious place to record the sampling intervals (angular and 
distance) of the radar data. However, if I'm understanding this correctly, all 
the user really needs is the total number of data bins in one sweep of the 
radar. I'd argue that this is similar in concept to 
'maximum_number_of_observations' i.e. maybe we just need a new standard name 
that we can both use. What do you think?

Apologies if I haven't fully grasped the complexities of your data.

Regards,

Dan

Re: [CF-metadata] Missing data bins in histograms

2019-05-14 Thread Hollis, Dan
Hi Martin,

Sorry, I didn't mean to imply that we would do away with the histogram standard 
names - these would be retained, of course. I just meant that we both want to 
store one extra bit of information (maximum number of obs or, equivalently, 
missing number of obs) and that in both use cases ('histogram_of...' and 
'number_of...') this could be in an ancillary variable, for which we'd need a 
new standard name. Does that make more sense?

I appreciate that your users wish to display the number of missing values 
alongside the counts for the different bins, however I'd argue that this 
information is ancillary to the histogram itself (in the same way that the 
number of missing days is ancillary to a count of days of air frost) and should 
be stored as such in the netCDF file (rather than in a 'pseudo-bin').

Regards,

Dan


-Original Message-
From: Martin Juckes - UKRI STFC  
Sent: Tuesday, 14 May 2019 13:29
To: Hollis, Dan ; Gregory, Jonathan 
; cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: Re: [CF-metadata] Missing data bins in histograms

Hi Dan,


it is a similar concept, but the aim here is to record it in a histogram. We 
have a standard name for the histogram  .. I'm not sure why you think we need 
to change this. Perhaps it would be possible to do away with "histogram_" 
standard names and just use "number_of_observations", but I'm afraid I don't 
see much merit in that approach.


For you use case, I can certainly see that there could be a case for a 
"number_of_missing_observations" standard name, but it doesn't help with the 
specification of the histogram that I want to store.


regards,

Martin


From: Hollis, Dan 
Sent: 14 May 2019 13:13
To: Juckes, Martin (STFC,RAL,RALSP); Gregory, Jonathan; 
cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: RE: [CF-metadata] Missing data bins in histograms

Hi Martin,

Thanks for your suggestion - I can see how this could work for our data. 
However I can also see that having to parse the 'interval' text from the 
'cell_methods' comment field and combine that with the bounds from the time 
coordinate is not especially user-friendly! It would be much easier if we could 
store 'maximum_number_of_observations' (or 'number_of_missing_observations') as 
well.

I guess the reason your suggestion does not work for your histograms is that 
there is no obvious place to record the sampling intervals (angular and 
distance) of the radar data. However, if I'm understanding this correctly, all 
the user really needs is the total number of data bins in one sweep of the 
radar. I'd argue that this is similar in concept to 
'maximum_number_of_observations' i.e. maybe we just need a new standard name 
that we can both use. What do you think?

Apologies if I haven't fully grasped the complexities of your data.

Regards,

Dan

-Original Message-
From: Martin Juckes - UKRI STFC 
Sent: Tuesday, 14 May 2019 12:02
To: Hollis, Dan ; Gregory, Jonathan 
; cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: Re: [CF-metadata] Missing data bins in histograms

Hello Dan,


I think there is a method for recording the number of valid observations in 
each data point, which, if I've understood correctly, would meet the 
requirement you are describing: using an "ancillary_variable" with standard 
name "number_of_observations".  I don't think there is a method for explicitly 
recording missing values, but you can use "interval" (in the "cell_methods" 
comment) to specify the interval of input data which, together with the 
duration of the calculation, will tell you the maximum amount of input values 
available.


In your use-case the number of missing values would be part of the ancillary 
information, in my use case it is the data itself -- the users want a histogram 
which includes a count of failed retrievals,


regards,

Martin


From: Hollis, Dan 
Sent: 14 May 2019 11:22
To: Juckes, Martin (STFC,RAL,RALSP); Gregory, Jonathan; 
cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: RE: [CF-metadata] Missing data bins in histograms

Dear Martin/Jonathan/Jim,

I appreciate that this discussion is focussed on histograms, however I wonder 
if there is a wider issue here i.e. how should one record the number of missing 
values for any extensive quantity?

For example, we use number_of_days_with_air_temperature_below_threshold to 
store counts of days of air frost (computed from station observations of daily 
minimum temperature). The threshold is specified using a scalar coordinate 
variable called 'air_temperature' with a value of 0.0. The counts of air frost 
are for periods of months, seasons or years and, inevitably, the values for 
some periods for some stations are based on incomplete data. Is there a 
recommended method for recording the number of missing observations for each 
data point (apologies if I've missed this in the conventions)? If so then maybe 
the same approach could be used 

Re: [CF-metadata] Missing data bins in histograms

2019-05-14 Thread Jonathan Gregory
Dear Martin

I agree that if valid_range implies masked-out data in some software, we can't
put special values out of the range, and that we shouldn't tamper with missing
data. I still think that flag_values is a better way to indicate special
values in a coordinate variable than an auxiliary coordinate variable would be.
If there are flag values, by definition those values aren't physical coordinate
values, and the user of such data need to be aware of that. That would be the
consequence of changing the convention to allow flag_values for coordinate
variables, just as it is presently the case that a user of a data variable
ought to check whether it has flag_values, which would likewise indicate that
some of the valid values are not actually physical values. However I don't
think we ought to change the standard_name to signal it, since introducing new
standard_names requires software to recognise both versions.

Best wishes

Jonathan

- Forwarded message from Martin Juckes - UKRI STFC 
 -

> Date: Tue, 14 May 2019 09:03:19 +
> From: Martin Juckes - UKRI STFC 
> To: Jonathan Gregory , "cf-metadata@cgd.ucar.edu"
>   
> Subject: Re: [CF-metadata] Missing data bins in histograms
> 
> Dear Jonathan,
> 
> 
> I looked at "valid_range", and also "actual_range", but I believe that the 
> definitions of either of these would have to be changed to accommodate this 
> usage, and we would run into the problem that Jim raised in connection with 
> my earlier suggestion of using "missing_value": such changes can break 
> assumptions made by existing software. Data outside the "valid_range" may 
> well be automatically rejected by a user application before the data gets to 
> any CF aware libraries. For instance, python netCDF4 at version 1.3.0 and 
> 1.3.1 automatically removes data outside the valid_range, giving the user a 
> masked array.  There is some discussion of this here: 
> https://github.com/Unidata/netcdf4-python/issues/748.
> 
> 
> It is possible to 
> circumvent this behaviour by changing the auto-masking setting in python 
> netCDF4, and the NUG does suggest using values outside the "valid_range" as 
> flags. NUG also suggests using the missing_value attribute to list such flag 
> values ... but Jim has pointed out that such an approach is likely to cause 
> problems with many applications. This is a complex area because the meaning 
> of "missing_value" in NUG has evolved. Up until CF 1.5 it appears that a 
> "missing_value" meant, unambiguously, missing data.  The current CF appears 
> to changed this in line with NUG so that different usages are now 
> permissible, but I still agree with Jim's objection. We can't, I'm sure, at 
> this stage, follow an approach which depends on users being able to control 
> the auto-masking settings (it is a simple call to the "set_auto_mask" method 
> if you are using the python netCDF4 library directly ... but may not be 
> available to users who are working with applications built on the library).
> 
> 
> I wanted to use a new standard name for the hight bins because of the fact 
> that the value in the first bin, which I have set to -., is not a height. 
> This data point needs to have a valid floating point value to conform to the 
> rules for a coordinate array, but, unlike the rest of the array, it should 
> not be interpreted as height. This is signalled by the presence of an 
> auxiliary coordinate -- but I'm not sure that that is adequate. Applications 
> and users are entitled to believe that a variable which has standard name 
> "height" really refers to height, without having to check all the auxiliary 
> coordinates to see if there is something there which modifies the meaning of 
> the variable. The standard name "height_bins" would signal that they must 
> look in the auxiliary coordinate.
> 
> 
> Do you agree with the necessity and appropriateness of the new name of 
> "bin_status_flag" which I have suggested for the auxiliary coordinate?
> 
> 
> regards,
> 
> Martin
> 
> 
> From: CF-metadata  on behalf of Jonathan 
> Gregory 
> Sent: 13 May 2019 18:00
> To: cf-metadata@cgd.ucar.edu
> Subject: Re: [CF-metadata] Missing data bins in histograms
> 
> Dear Martin
> 
> I agree that an alternative which would not require a change to the
> convention is to attach a string-valued aux coord variable. However, the
> flags are much more economical and seem natural, as you say.
> 
> As I said in my last email, I feel that it's better to keep the standard name
> as it is, despite the presence of a special value in it which isn't really a
> coordinate value. Maybe a valid_range could be specified, with the special
> value outside the range? I'm not sure if that would count as an error, but it
> is not the same as reinterpreting missing data, which would be problematic.
> 
> Best wishes
> 
> Jonathan
> 
> - Forwarded message from Martin Juckes - UKRI STFC 
>  

Re: [CF-metadata] Missing data bins in histograms

2019-05-14 Thread Martin Juckes - UKRI STFC
Hi Dan,


it is a similar concept, but the aim here is to record it in a histogram. We 
have a standard name for the histogram  .. I'm not sure why you think we need 
to change this. Perhaps it would be possible to do away with "histogram_" 
standard names and just use "number_of_observations", but I'm afraid I don't 
see much merit in that approach.


For you use case, I can certainly see that there could be a case for a 
"number_of_missing_observations" standard name, but it doesn't help with the 
specification of the histogram that I want to store.


regards,

Martin


From: Hollis, Dan 
Sent: 14 May 2019 13:13
To: Juckes, Martin (STFC,RAL,RALSP); Gregory, Jonathan; 
cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: RE: [CF-metadata] Missing data bins in histograms

Hi Martin,

Thanks for your suggestion - I can see how this could work for our data. 
However I can also see that having to parse the 'interval' text from the 
'cell_methods' comment field and combine that with the bounds from the time 
coordinate is not especially user-friendly! It would be much easier if we could 
store 'maximum_number_of_observations' (or 'number_of_missing_observations') as 
well.

I guess the reason your suggestion does not work for your histograms is that 
there is no obvious place to record the sampling intervals (angular and 
distance) of the radar data. However, if I'm understanding this correctly, all 
the user really needs is the total number of data bins in one sweep of the 
radar. I'd argue that this is similar in concept to 
'maximum_number_of_observations' i.e. maybe we just need a new standard name 
that we can both use. What do you think?

Apologies if I haven't fully grasped the complexities of your data.

Regards,

Dan

-Original Message-
From: Martin Juckes - UKRI STFC 
Sent: Tuesday, 14 May 2019 12:02
To: Hollis, Dan ; Gregory, Jonathan 
; cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: Re: [CF-metadata] Missing data bins in histograms

Hello Dan,


I think there is a method for recording the number of valid observations in 
each data point, which, if I've understood correctly, would meet the 
requirement you are describing: using an "ancillary_variable" with standard 
name "number_of_observations".  I don't think there is a method for explicitly 
recording missing values, but you can use "interval" (in the "cell_methods" 
comment) to specify the interval of input data which, together with the 
duration of the calculation, will tell you the maximum amount of input values 
available.


In your use-case the number of missing values would be part of the ancillary 
information, in my use case it is the data itself -- the users want a histogram 
which includes a count of failed retrievals,


regards,

Martin


From: Hollis, Dan 
Sent: 14 May 2019 11:22
To: Juckes, Martin (STFC,RAL,RALSP); Gregory, Jonathan; 
cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: RE: [CF-metadata] Missing data bins in histograms

Dear Martin/Jonathan/Jim,

I appreciate that this discussion is focussed on histograms, however I wonder 
if there is a wider issue here i.e. how should one record the number of missing 
values for any extensive quantity?

For example, we use number_of_days_with_air_temperature_below_threshold to 
store counts of days of air frost (computed from station observations of daily 
minimum temperature). The threshold is specified using a scalar coordinate 
variable called 'air_temperature' with a value of 0.0. The counts of air frost 
are for periods of months, seasons or years and, inevitably, the values for 
some periods for some stations are based on incomplete data. Is there a 
recommended method for recording the number of missing observations for each 
data point (apologies if I've missed this in the conventions)? If so then maybe 
the same approach could be used for histograms too. If not then my feeling is 
that whatever solution you propose should be applicable to all extensive 
quantities (i.e. all quantities that can be derived from a set of constituent 
observations). Having a special 'bin' might work for histogram data but would 
not work for other variables so I think a different approach is required.

My feeling is that the number of missing values is sort of like metadata i.e. 
it's telling you something about the quality of the data itself. Would an 
ancillary variable suit this purpose?

Regards,

Dan


-Original Message-
From: CF-metadata  On Behalf Of Martin Juckes 
- UKRI STFC
Sent: Tuesday, 14 May 2019 10:03
To: Gregory, Jonathan ; cf-metadata@cgd.ucar.edu
Subject: Re: [CF-metadata] Missing data bins in histograms

Dear Jonathan,


I looked at "valid_range", and also "actual_range", but I believe that the 
definitions of either of these would have to be changed to accommodate this 
usage, and we would run into the problem that Jim raised in connection with my 
earlier suggestion of using 

Re: [CF-metadata] Missing data bins in histograms

2019-05-14 Thread Hollis, Dan
Hi Martin,

Thanks for your suggestion - I can see how this could work for our data. 
However I can also see that having to parse the 'interval' text from the 
'cell_methods' comment field and combine that with the bounds from the time 
coordinate is not especially user-friendly! It would be much easier if we could 
store 'maximum_number_of_observations' (or 'number_of_missing_observations') as 
well.

I guess the reason your suggestion does not work for your histograms is that 
there is no obvious place to record the sampling intervals (angular and 
distance) of the radar data. However, if I'm understanding this correctly, all 
the user really needs is the total number of data bins in one sweep of the 
radar. I'd argue that this is similar in concept to 
'maximum_number_of_observations' i.e. maybe we just need a new standard name 
that we can both use. What do you think?

Apologies if I haven't fully grasped the complexities of your data.

Regards,

Dan

-Original Message-
From: Martin Juckes - UKRI STFC  
Sent: Tuesday, 14 May 2019 12:02
To: Hollis, Dan ; Gregory, Jonathan 
; cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: Re: [CF-metadata] Missing data bins in histograms

Hello Dan,


I think there is a method for recording the number of valid observations in 
each data point, which, if I've understood correctly, would meet the 
requirement you are describing: using an "ancillary_variable" with standard 
name "number_of_observations".  I don't think there is a method for explicitly 
recording missing values, but you can use "interval" (in the "cell_methods" 
comment) to specify the interval of input data which, together with the 
duration of the calculation, will tell you the maximum amount of input values 
available.


In your use-case the number of missing values would be part of the ancillary 
information, in my use case it is the data itself -- the users want a histogram 
which includes a count of failed retrievals,


regards,

Martin


From: Hollis, Dan 
Sent: 14 May 2019 11:22
To: Juckes, Martin (STFC,RAL,RALSP); Gregory, Jonathan; 
cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: RE: [CF-metadata] Missing data bins in histograms

Dear Martin/Jonathan/Jim,

I appreciate that this discussion is focussed on histograms, however I wonder 
if there is a wider issue here i.e. how should one record the number of missing 
values for any extensive quantity?

For example, we use number_of_days_with_air_temperature_below_threshold to 
store counts of days of air frost (computed from station observations of daily 
minimum temperature). The threshold is specified using a scalar coordinate 
variable called 'air_temperature' with a value of 0.0. The counts of air frost 
are for periods of months, seasons or years and, inevitably, the values for 
some periods for some stations are based on incomplete data. Is there a 
recommended method for recording the number of missing observations for each 
data point (apologies if I've missed this in the conventions)? If so then maybe 
the same approach could be used for histograms too. If not then my feeling is 
that whatever solution you propose should be applicable to all extensive 
quantities (i.e. all quantities that can be derived from a set of constituent 
observations). Having a special 'bin' might work for histogram data but would 
not work for other variables so I think a different approach is required.

My feeling is that the number of missing values is sort of like metadata i.e. 
it's telling you something about the quality of the data itself. Would an 
ancillary variable suit this purpose?

Regards,

Dan


-Original Message-
From: CF-metadata  On Behalf Of Martin Juckes 
- UKRI STFC
Sent: Tuesday, 14 May 2019 10:03
To: Gregory, Jonathan ; cf-metadata@cgd.ucar.edu
Subject: Re: [CF-metadata] Missing data bins in histograms

Dear Jonathan,


I looked at "valid_range", and also "actual_range", but I believe that the 
definitions of either of these would have to be changed to accommodate this 
usage, and we would run into the problem that Jim raised in connection with my 
earlier suggestion of using "missing_value": such changes can break assumptions 
made by existing software. Data outside the "valid_range" may well be 
automatically rejected by a user application before the data gets to any CF 
aware libraries. For instance, python netCDF4 at version 1.3.0 and 1.3.1 
automatically removes data outside the valid_range, giving the user a masked 
array.  There is some discussion of this here: 
https://github.com/Unidata/netcdf4-python/issues/748.


It is possible to 
circumvent this behaviour by changing the auto-masking setting in python 
netCDF4, and the NUG does suggest using values outside the "valid_range" as 
flags. NUG also suggests using the missing_value attribute to list such flag 
values ... but Jim has pointed out that such an approach is likely 

Re: [CF-metadata] Missing data bins in histograms

2019-05-14 Thread Martin Juckes - UKRI STFC
Hello Dan,


I think there is a method for recording the number of valid observations in 
each data point, which, if I've understood correctly, would meet the 
requirement you are describing: using an "ancillary_variable" with standard 
name "number_of_observations".  I don't think there is a method for explicitly 
recording missing values, but you can use "interval" (in the "cell_methods" 
comment) to specify the interval of input data which, together with the 
duration of the calculation, will tell you the maximum amount of input values 
available.


In your use-case the number of missing values would be part of the ancillary 
information, in my use case it is the data itself -- the users want a histogram 
which includes a count of failed retrievals,


regards,

Martin


From: Hollis, Dan 
Sent: 14 May 2019 11:22
To: Juckes, Martin (STFC,RAL,RALSP); Gregory, Jonathan; 
cf-metadata@cgd.ucar.edu; jbi...@cicsnc.org
Subject: RE: [CF-metadata] Missing data bins in histograms

Dear Martin/Jonathan/Jim,

I appreciate that this discussion is focussed on histograms, however I wonder 
if there is a wider issue here i.e. how should one record the number of missing 
values for any extensive quantity?

For example, we use number_of_days_with_air_temperature_below_threshold to 
store counts of days of air frost (computed from station observations of daily 
minimum temperature). The threshold is specified using a scalar coordinate 
variable called 'air_temperature' with a value of 0.0. The counts of air frost 
are for periods of months, seasons or years and, inevitably, the values for 
some periods for some stations are based on incomplete data. Is there a 
recommended method for recording the number of missing observations for each 
data point (apologies if I've missed this in the conventions)? If so then maybe 
the same approach could be used for histograms too. If not then my feeling is 
that whatever solution you propose should be applicable to all extensive 
quantities (i.e. all quantities that can be derived from a set of constituent 
observations). Having a special 'bin' might work for histogram data but would 
not work for other variables so I think a different approach is required.

My feeling is that the number of missing values is sort of like metadata i.e. 
it's telling you something about the quality of the data itself. Would an 
ancillary variable suit this purpose?

Regards,

Dan


-Original Message-
From: CF-metadata  On Behalf Of Martin Juckes 
- UKRI STFC
Sent: Tuesday, 14 May 2019 10:03
To: Gregory, Jonathan ; cf-metadata@cgd.ucar.edu
Subject: Re: [CF-metadata] Missing data bins in histograms

Dear Jonathan,


I looked at "valid_range", and also "actual_range", but I believe that the 
definitions of either of these would have to be changed to accommodate this 
usage, and we would run into the problem that Jim raised in connection with my 
earlier suggestion of using "missing_value": such changes can break assumptions 
made by existing software. Data outside the "valid_range" may well be 
automatically rejected by a user application before the data gets to any CF 
aware libraries. For instance, python netCDF4 at version 1.3.0 and 1.3.1 
automatically removes data outside the valid_range, giving the user a masked 
array.  There is some discussion of this here: 
https://github.com/Unidata/netcdf4-python/issues/748.


It is possible to 
circumvent this behaviour by changing the auto-masking setting in python 
netCDF4, and the NUG does suggest using values outside the "valid_range" as 
flags. NUG also suggests using the missing_value attribute to list such flag 
values ... but Jim has pointed out that such an approach is likely to cause 
problems with many applications. This is a complex area because the meaning of 
"missing_value" in NUG has evolved. Up until CF 1.5 it appears that a 
"missing_value" meant, unambiguously, missing data.  The current CF appears to 
changed this in line with NUG so that different usages are now permissible, but 
I still agree with Jim's objection. We can't, I'm sure, at this stage, follow 
an approach which depends on users being able to control the auto-masking 
settings (it is a simple call to the "set_auto_mask" method if you are using 
the python netCDF4 library directly ... but may not be available to users who 
are worki
 ng with applications built on the library).


I wanted to use a new standard name for the hight bins because of the fact that 
the value in the first bin, which I have set to -., is not a height. This 
data point needs to have a valid floating point value to conform to the rules 
for a coordinate array, but, unlike the rest of the array, it should not be 
interpreted as height. This is signalled by the presence of an auxiliary 
coordinate -- but I'm not sure that that is adequate. Applications and users 
are entitled to believe that a variable which has 

Re: [CF-metadata] Missing data bins in histograms

2019-05-14 Thread Hollis, Dan
Dear Martin/Jonathan/Jim,

I appreciate that this discussion is focussed on histograms, however I wonder 
if there is a wider issue here i.e. how should one record the number of missing 
values for any extensive quantity?

For example, we use number_of_days_with_air_temperature_below_threshold to 
store counts of days of air frost (computed from station observations of daily 
minimum temperature). The threshold is specified using a scalar coordinate 
variable called 'air_temperature' with a value of 0.0. The counts of air frost 
are for periods of months, seasons or years and, inevitably, the values for 
some periods for some stations are based on incomplete data. Is there a 
recommended method for recording the number of missing observations for each 
data point (apologies if I've missed this in the conventions)? If so then maybe 
the same approach could be used for histograms too. If not then my feeling is 
that whatever solution you propose should be applicable to all extensive 
quantities (i.e. all quantities that can be derived from a set of constituent 
observations). Having a special 'bin' might work for histogram data but would 
not work for other variables so I think a different approach is required.

My feeling is that the number of missing values is sort of like metadata i.e. 
it's telling you something about the quality of the data itself. Would an 
ancillary variable suit this purpose?

Regards,

Dan


-Original Message-
From: CF-metadata  On Behalf Of Martin Juckes 
- UKRI STFC
Sent: Tuesday, 14 May 2019 10:03
To: Gregory, Jonathan ; cf-metadata@cgd.ucar.edu
Subject: Re: [CF-metadata] Missing data bins in histograms

Dear Jonathan,


I looked at "valid_range", and also "actual_range", but I believe that the 
definitions of either of these would have to be changed to accommodate this 
usage, and we would run into the problem that Jim raised in connection with my 
earlier suggestion of using "missing_value": such changes can break assumptions 
made by existing software. Data outside the "valid_range" may well be 
automatically rejected by a user application before the data gets to any CF 
aware libraries. For instance, python netCDF4 at version 1.3.0 and 1.3.1 
automatically removes data outside the valid_range, giving the user a masked 
array.  There is some discussion of this here: 
https://github.com/Unidata/netcdf4-python/issues/748.


It is possible to 
circumvent this behaviour by changing the auto-masking setting in python 
netCDF4, and the NUG does suggest using values outside the "valid_range" as 
flags. NUG also suggests using the missing_value attribute to list such flag 
values ... but Jim has pointed out that such an approach is likely to cause 
problems with many applications. This is a complex area because the meaning of 
"missing_value" in NUG has evolved. Up until CF 1.5 it appears that a 
"missing_value" meant, unambiguously, missing data.  The current CF appears to 
changed this in line with NUG so that different usages are now permissible, but 
I still agree with Jim's objection. We can't, I'm sure, at this stage, follow 
an approach which depends on users being able to control the auto-masking 
settings (it is a simple call to the "set_auto_mask" method if you are using 
the python netCDF4 library directly ... but may not be available to users who 
are worki
 ng with applications built on the library).


I wanted to use a new standard name for the hight bins because of the fact that 
the value in the first bin, which I have set to -., is not a height. This 
data point needs to have a valid floating point value to conform to the rules 
for a coordinate array, but, unlike the rest of the array, it should not be 
interpreted as height. This is signalled by the presence of an auxiliary 
coordinate -- but I'm not sure that that is adequate. Applications and users 
are entitled to believe that a variable which has standard name "height" really 
refers to height, without having to check all the auxiliary coordinates to see 
if there is something there which modifies the meaning of the variable. The 
standard name "height_bins" would signal that they must look in the auxiliary 
coordinate.


Do you agree with the necessity and appropriateness of the new name of 
"bin_status_flag" which I have suggested for the auxiliary coordinate?


regards,

Martin


From: CF-metadata  on behalf of Jonathan 
Gregory 
Sent: 13 May 2019 18:00
To: cf-metadata@cgd.ucar.edu
Subject: Re: [CF-metadata] Missing data bins in histograms

Dear Martin

I agree that an alternative which would not require a change to the convention 
is to attach a string-valued aux coord variable. However, the flags are much 
more economical and seem natural, as you say.

As I said in my last email, I feel that it's better to keep the standard name 
as it is, despite the presence of a special value in it which isn't really a