# Title

Lossy Compression by Coordinate Sampling

# Moderator
@user

# Moderator Status Review [last updated: YYYY-MM-DD]

Brief comment on current status, update periodically

# Requirement Summary

The spatiotemporal, spectral, and thematic resolution of Earth science data are 
increasing rapidly. This presents a challenge for all types of Earth science 
data, whether it is derived from models, in-situ, or remote sensing 
observations.

In particular, when coordinate information varies with time, the domain 
definition can be many times larger than the (potentially already very large) 
data which it describes. This is often the case for remote sensing products, 
such as a swath measurements from a polar orbiting satellite (e.g. slide 4 in 
https://urldefense.us/v3/__https://cfconventions.org/Meetings/2020-workshop/Subsampled-coordinates-in-CF-netCDF.pdf__;!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_XcgXzoA$
 ).

Such datasets are often prohibitively expensive to store, and so some form of 
compression is required. However, native compression, such as is available in 
the HDF5 library, does not generally provide enough of a saving, due to the 
nature of the values being compressed (e.g. few missing or repeated values).

An alternative form of compression-by-convention amounts to storing only a 
small subsample of the coordinate values, alongside an interpolation algorithm 
that describes how the subsample can be used to generate the original, 
unsampled set of coordinates. This form of compression has been shown to 
out-perform native compression by "orders of magnitude" (e.g. slide 6 in 
https://urldefense.us/v3/__https://cfconventions.org/Meetings/2020-workshop/Subsampled-coordinates-in-CF-netCDF.pdf__;!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_XcgXzoA$
 ).

Various implementations following this broad methodology are currently in use 
(see 
https://urldefense.us/v3/__https://github.com/cf-convention/discuss/issues/37*issuecomment-608459133__;Iw!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_EpkVdSg$
  for examples), however, the steps that are required to reconstitute the full 
resolution coordinates are not necessarily well defined within a dataset.

This proposal offers a standardized approach covering the complete end-to-end 
process, including a detailed description of the required steps. At the same 
time it is a framework where new methods can be added or existing methods can 
be extended.

Unlike [compression by 
gathering](https://urldefense.us/v3/__https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html*compression-by-gathering__;Iw!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_2XKsThw$
 ), this form of compression is lossy due to rounding and approximation errors 
in the required interpolation calculations. However, the loss in accuracy is a 
function of the degree to which the coordinates are subsampled, and the choice 
of interpolation algorithm (of which there are configurable standardized and 
non-standardized options), and so may be determined by the data creator to be 
within acceptable limits. For example, in one application with cell sizes of 
approximately 750 metres by 750 metres, interpolation of a stored subsample 
comprising every 16th value in each dimension was able to recreate the original 
coordinate values to a mean accuracy of ~1 metre. (Details of this test are 
available.)

Whilst remote sensing applications are the motivating concern for this 
proposal, the approach presented has been designed to be fully general, and so 
can be applied to structured coordinates describing any domain, such as one 
describing model outputs.

# Technical Proposal Summary

See PR #326  for details. In summary:

The approach and encoding is fully described in the new section 8.3 "Lossy 
Compression by Coordinate Sampling" to [Chapter 8: Reduction of Dataset 
Size](https://urldefense.us/v3/__https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html*_reduction_of_dataset_size__;Iw!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_QPUI4NE$
 ).

A new appendix J describes the standardized interpolation algorithms, and 
includes guidance for data creators.

[Appendix 
A](https://urldefense.us/v3/__https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html*attribute-appendix__;Iw!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_NCWKAGU$
 ) has been updated for a new data and domain variable attribute.

[The conformance 
document](https://urldefense.us/v3/__https://cfconventions.org/Data/cf-documents/requirements-recommendations/conformance-1.8.html__;!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_VJ9trLo$
 ) has new checks for all of the new content.

The new "interpolation variable" has been included in the 
[Terminology](https://urldefense.us/v3/__https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html*terminology__;Iw!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_pI9Bhus$
 ) in [Chapter 
1](https://urldefense.us/v3/__https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html*_introduction__;Iw!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_N9m85Ao$
 ).

The list of examples in 
[toc-extra.adoc](https://urldefense.us/v3/__https://github.com/cf-convention/cf-conventions/blob/master/toc-extra.adoc__;!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_yoZSTyI$
 ) has been updated for the new examples in section 8.3.

# Benefits

Anyone may benefit who has prohibitively large domain descriptions for which 
absolute accuracy of cell locations is not an issue.

# Status Quo

The storage of large, structure domain descriptions is either prohibitively 
expensive, or is handled non-standardized ways

# Associated pull request

PR #326

# Detailed Proposal

PR #326

# Authors

This proposal has been put together by (in alphabetic order)

Aleksandar Jelenak
Anders Meier Soerensen
Daniel Lee
David Hassell
Lucile Gaultier
Sylvain Herlédan
Thomas Lavergne 



-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://urldefense.us/v3/__https://github.com/cf-convention/cf-conventions/issues/327__;!!G2kpM7uM-TzIFchu!h5Puq4AHSI7sPraCuoeoxl5UoS-GB-oDEtnextigG9GCWYQrRUd-vE01nuw2bsVc18j_x3eerqQ$
 
This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Reply via email to