# Title Lossy Compression by Coordinate Sampling
# Moderator @user # Moderator Status Review [last updated: YYYY-MM-DD] Brief comment on current status, update periodically # Requirement Summary The spatiotemporal, spectral, and thematic resolution of Earth science data are increasing rapidly. This presents a challenge for all types of Earth science data, whether it is derived from models, in-situ, or remote sensing observations. In particular, when coordinate information varies with time, the domain definition can be many times larger than the (potentially already very large) data which it describes. This is often the case for remote sensing products, such as a swath measurements from a polar orbiting satellite (e.g. slide 4 in https://urldefense.us/v3/__https://cfconventions.org/Meetings/2020-workshop/Subsampled-coordinates-in-CF-netCDF.pdf__;!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_XcgXzoA$ ). Such datasets are often prohibitively expensive to store, and so some form of compression is required. However, native compression, such as is available in the HDF5 library, does not generally provide enough of a saving, due to the nature of the values being compressed (e.g. few missing or repeated values). An alternative form of compression-by-convention amounts to storing only a small subsample of the coordinate values, alongside an interpolation algorithm that describes how the subsample can be used to generate the original, unsampled set of coordinates. This form of compression has been shown to out-perform native compression by "orders of magnitude" (e.g. slide 6 in https://urldefense.us/v3/__https://cfconventions.org/Meetings/2020-workshop/Subsampled-coordinates-in-CF-netCDF.pdf__;!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_XcgXzoA$ ). Various implementations following this broad methodology are currently in use (see https://urldefense.us/v3/__https://github.com/cf-convention/discuss/issues/37*issuecomment-608459133__;Iw!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_EpkVdSg$ for examples), however, the steps that are required to reconstitute the full resolution coordinates are not necessarily well defined within a dataset. This proposal offers a standardized approach covering the complete end-to-end process, including a detailed description of the required steps. At the same time it is a framework where new methods can be added or existing methods can be extended. Unlike [compression by gathering](https://urldefense.us/v3/__https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html*compression-by-gathering__;Iw!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_2XKsThw$ ), this form of compression is lossy due to rounding and approximation errors in the required interpolation calculations. However, the loss in accuracy is a function of the degree to which the coordinates are subsampled, and the choice of interpolation algorithm (of which there are configurable standardized and non-standardized options), and so may be determined by the data creator to be within acceptable limits. For example, in one application with cell sizes of approximately 750 metres by 750 metres, interpolation of a stored subsample comprising every 16th value in each dimension was able to recreate the original coordinate values to a mean accuracy of ~1 metre. (Details of this test are available.) Whilst remote sensing applications are the motivating concern for this proposal, the approach presented has been designed to be fully general, and so can be applied to structured coordinates describing any domain, such as one describing model outputs. # Technical Proposal Summary See PR #326 for details. In summary: The approach and encoding is fully described in the new section 8.3 "Lossy Compression by Coordinate Sampling" to [Chapter 8: Reduction of Dataset Size](https://urldefense.us/v3/__https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html*_reduction_of_dataset_size__;Iw!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_QPUI4NE$ ). A new appendix J describes the standardized interpolation algorithms, and includes guidance for data creators. [Appendix A](https://urldefense.us/v3/__https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html*attribute-appendix__;Iw!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_NCWKAGU$ ) has been updated for a new data and domain variable attribute. [The conformance document](https://urldefense.us/v3/__https://cfconventions.org/Data/cf-documents/requirements-recommendations/conformance-1.8.html__;!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_VJ9trLo$ ) has new checks for all of the new content. The new "interpolation variable" has been included in the [Terminology](https://urldefense.us/v3/__https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html*terminology__;Iw!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_pI9Bhus$ ) in [Chapter 1](https://urldefense.us/v3/__https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html*_introduction__;Iw!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_N9m85Ao$ ). The list of examples in [toc-extra.adoc](https://urldefense.us/v3/__https://github.com/cf-convention/cf-conventions/blob/master/toc-extra.adoc__;!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_yoZSTyI$ ) has been updated for the new examples in section 8.3. # Benefits Anyone may benefit who has prohibitively large domain descriptions for which absolute accuracy of cell locations is not an issue. # Status Quo The storage of large, structure domain descriptions is either prohibitively expensive, or is handled non-standardized ways # Associated pull request PR #326 # Detailed Proposal PR #326 # Authors This proposal has been put together by (in alphabetic order) Aleksandar Jelenak Anders Meier Soerensen Daniel Lee David Hassell Lucile Gaultier Sylvain Herlédan Thomas Lavergne -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://urldefense.us/v3/__https://github.com/cf-convention/cf-conventions/issues/327__;!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_x3eerqQ$ This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.