Dear All, The latest version (4.4.8) of NCO contains a Precision-Preserving Compression (PPC) feature that might benefit from wider discussion before its associated metadata are finalized. If you are interested in precision, compression, or just procrastination, please join a discussion on changes or improvements to the scheme I've devised.
More documentation on PPC algorithms and performance details is at http://nco.sf.net/nco.html#ppc However, I think any changes to CF would focus on definitions (of precision) and implementation. For data that are rounded (quantized), users want to know what that means, not necessarily how it was performed. The meaning of data precision, and thus what is means for data to be "rounded" or "quantized" could be clarified in CF with something like the text drafted below. These changes adequately represent, I think, an existing metadata annotation for precision used in nc3tonc4 by Jeff Whitaker, which NCO has adopted (called DSD below), as well as an annotation for a new method of quantization (called NSD below) introduced in NCO. You will see that it boils down to adding an attribute that indicates the type and degree of imposed precision. A possibility that I considered before discarding was to specify the absolute precision in units of the stored variable (rather than the number of significant digits). There are arguments both ways... The suggested CF changes below are a minimal way of specifying how data have been quantized. A more general metadata framework for precision might include distinctions for intrinsic precision of measurement/model (in addition to precision due to post-processing or rounding), notations helpful for propagating errors, and how to specify precision lost due to packing/unpacking. None of that is in the below draft, which simply extends CF to cover precision imposed by NSD and DSD quantization. If you want or don't want CF to recommend attributes describing precision and/or lossy compression then please comment... Best, Charlie Current CF: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/cf-conventions.html#packed-data "Methods for reducing the total volume of data include both packing and compression. Packing reduces the data volume by reducing the precision of the stored numbers. It is implemented using the attributes add_offset and scale_factor which are defined in the NUG. Compression on the other hand loses no precision, but reduces the volume by not storing missing data. The attribute compress is defined for this purpose." Proposed CF: "Methods for reducing the total volume of data include packing, rounding, and compression. Packing reduces the data volume by reducing the range and precision of the stored numbers. It is implemented using the attributes add_offset and scale_factor which are defined in the NUG. Rounding preserves data values to a specified level of precision, with no required loss in range. It is implemented using bitmasking or other quantization techniques. Compression on the other hand loses no precision, but reduces the volume by not storing missing data. The attribute compress is defined for this purpose." ... "Packing quantizes data from a floating point representation into an integer representation within a limited range that requires only one-half or one-quarter of the number of floating-point bytes. For values that occupy a limited range, typically about five orders of magnitude, packing yields an efficient tradeoff between precision and size because all bits are dedicated to precision, not to exponents. A limitation of packing is that unpacking data stored as integers into the linear range defined by scale_factor and add_offset rapidly loses precision outside of a narrow range of floating point values. Variables packed as NC_SHORT, for example, can represent only about 64000 discrete values in the range -32768*scale_factor+add_offset to 32767*scale_factor+add_offset. The precision of packed data equals the value of scale_factor, and scale_factor must be chosen to span the range of valid data, not to represent the intrinsic or desired precision of the values. Values that were packed and then unpacked have lost precision, although there is no standard way of recording this other than recording the history of the data processing. [One solution to this would be to record the former scale_factor of unpacked data in a precision attribute, e.g., "maximum_precision". Any champions for this?] Rounding allows per-variable specification of precision in terms of significant digits valid across the entire range of the floating point representation. The precision specification may take one of two forms, either the total number of significant digits (NSD), or the number of decimal significant digits (DSD), i.e., digits following (positive) or preceding (negative) the decimal point. The attributes "number_of_significant_digits" and "least_significant_digit" indicate that the variable has been rounded to the specifed precision using NSD or DSD definitions, respectively. The quantized values stored with these attributes are guaranteed to be within one-half of a unit increment in the value of the least significant digit. Consider, for example, a true value of 1776.0704. Approximations valid to a precision of NSD=2 (or DSD=-2) include 1800.0 and 1750.123, both of which are within 50 (one-half a unit increment in the hundreds digit) of the true value. Approximations valid to a precision of NSD=5 (or DSD=1) include 1776.1 and 1776.03, both of which are within 0.05 (one-half a unit increment in the tenths digit) of the true value. 8.2 ... -- Charlie Zender, Earth System Sci. & Computer Sci. University of California, Irvine 949-891-2429 )'( _______________________________________________ CF-metadata mailing list [email protected] http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
