Re: [CF-metadata] [cf-convention/cf-conventions] Lossy Compression by Coordinate Sampling (#327)

JonathanGregory Fri, 11 Jun 2021 01:50:18 -0700

Dear all

I've studied the text of proposed changes to Sect 8, as someone not at all 
involved in writing it or using these kinds of technique. (It's easier to read 
the files in [Daniel's 
repo](https://urldefense.us/v3/__https://github.com/erget/cf-conventions/blob/lossy-compression-through-coordinate-sampling/ch08.adoc__;!!G2kpM7uM-TzIFchu!iGX9_YVR65h2_BfQd8IiUVBizbXiSyxxYn9oae_IQantYdaz7fVqGPaJVaXHsSUjt-QNBr8-eT4$
 ) than [the pull 
request](https://urldefense.us/v3/__https://github.com/cf-convention/cf-conventions/pull/326/files?short_path=ebcafde*diff-ebcafde998cd56873e594e76a10c8541235a4fed3d4664a9c7733805bff39a4c__;Iw!!G2kpM7uM-TzIFchu!iGX9_YVR65h2_BfQd8IiUVBizbXiSyxxYn9oae_IQantYdaz7fVqGPaJVaXHsSUjt-QNh6J1SNs$
 ) in order to see the diagrams in place.) I think it all makes sense. It's 
well-designed and consistent with the rest of CF. Thanks for working it out so 
thoughtfully and carefully. The diagrams are very good as well.


I have not yet reviewed Appendix J or the conformance document. I'm going to be 
on leave next week, so I thought I'd contribute just this part before going.

Best wishes

Jonathan

There is one point where I have a suggestion for changing the content of the 
proposal, although probably you've already discussed this possibility. If I 
understand correctly, you must always have both the `tie_point_dimensions` and 
`tie_point_indices` attributes of the interpolation variable, and they must 
refer to the same tie point dimensions. Therefore I think a simpler design, 
easier for the both data-writer and data-reader to use, would combine these two 
attributes into one attribute, whose contents would be 
"*interpolation_dimension*`:` *tie_point_interpolation_dimension* 
*tie_point_index_variable* [*interpolation_zone_dimension*] 
[*interpolation_dimension*`:` ...]".

Also, I have some suggestions for naming:

* If you adopt my suggestion for a single attribute to replace 
`tie_point_dimensions` and `tie_point_indices`, an obvious name for it would be 
`tie_points`. You've used that name for the attribute of the data variable. 
However, I would suggest that the attribute of the data variable could equally 
well be called `interpolation`, since it names the interpolation variable, and 
signals that interpolation is to be used.

* Your terminology has "tie point interpolation dimension" and "interpolation 
dimension", but the former is not a special case of the latter. That could be 
confusing, in the same way that (unfortunately) in CF terminology an auxiliary 
coordinate variable is not a special kind of coordinate variable. I suggest you 
rename "tie point interpolation dimension" as e.g. "tie point reduced 
dimension" to avoid this misunderstanding.

* A similar possible confusion is that a tie point index variable is not a 
special kind of tie point variable. To avoid this confusion and add clarity, I 
suggest you could rename "tie point variable" as "tie point coordinate 
variable".

* The terms "interpolation zone" and "interpolation area" are unhelpful because 
it's not obvious from the words which one is bigger, so it's hard to remember. 
If you stick with "zone" for the small one, for area it would be better to use 
something which is more obviously much bigger, such as "province" or "realm"! 
Or perhaps you could use "division" or "department", since the defining 
characteristic is the discontinuity.

In the first paragraph of Sect 8 we distinguish three methods of reduction of 
datset size. I would suggest minor clarifications:

> There are three methods for reducing dataset size: packing, lossless 
> compression, and lossy compression. By packing we mean altering the data in a 
> way that reduces its precision **(but has no other effect on accuracy)**. By 
> lossless compression we mean techniques that store the data more efficiently 
> and result in no **loss of precision or accuracy**. By lossy compression we 
> mean techniques that store the data more efficiently **and retain its 
> precision** but result in some loss in accuracy.

Then I think we could start a new paragraph with "Lossless compression only 
works in certain circumstances ...". By the way, isn't it the case that HDF 
supports per-variable gzipping? That wasn't available in the old netCDF data 
format for which this section was first written, so it's not mentioned, but 
perhaps it should be now.

There are a few points where I found the text of Sect 8.3 possibly unclear or 
difficult to follow:

* "This form of compression may also be used on a domain variable with the same 
effect." I think this is an unclear addition. If I understand you correctly, 
insead of this final sentence you could begin the paragraph with "For some 
applications the coordinates of a data variable or a domain variable can 
require considerably more storage than the data in its domain."

* Tie Point Dimensions Attribute. If you adopt my suggestion above, this 
subsection would change its name to "Tie points attribute". It would be good to 
begin the section by saying what the attribute is for. As it stands, it plunges 
straigjt into details. The second sentence in particular, about interpolation 
zones, bewildered me - I didn't know what it was talking about.

* I follow this sentence: "For instance, interpolation dimension dimension1 
could be mapped to two different tie point interpolation dimensions with 
dimension1: tp_dimension1 dimension1: tp_dimension2." But I don't understand 
the next sentence: "This is necessary when different tie point variables for a 
particular interpolation dimension do not contain the same number of tie 
points, and therefore define different numbers of interpolation zones, as is 
the case in Multiple interpolation variables with interpolation parameter 
attributes." The situation described does not occur in the example quoted, I 
think. I wonder if it should say, "This occurs when data variables that share 
an interpolation dimension and interpolation variable have different tie points 
for that dimension."

* Instead of "A tie point variable must span at most one of the tie point 
interpolation dimensions associated with a given interpolation dimension." I 
would add a sentence to the first para of "Interpolation and non-interpolation 
dimension", which I would rewrite as follows:

> For each interpolation variable identified in the tie_points attribute, all 
> the associated tie point variables must share the same set of one or more 
> dimensions. Each of the dimensions of a tie point variable must be either a 
> dimension of the data variable, or a dimension of which is to be interpolated 
> to a dimension of the data variable. A tie point variable must not have more 
> than one dimension corresponding to any given dimension of the data variable, 
> and may have fewer dimensions than the data variable. Dimensions of the tie 
> point variable which are interpolated are called tie point reduced 
> dimensions, and the corresponding data variable dimensions are called 
> interpolation dimensions, while those for which no interpolation is required, 
> being the same in the data variable and the tie point variable, are called 
> non-interpolation dimensions. The size of a tie point reduced dimension must 
> be less than or equal to the size of the corresponding interpolation 
> dimension.

* In one place, you say "For each interpolation dimension, the number of 
interpolation zones is equal to the number of tie points minus the number of 
interpolation areas," and in another place, "An interpolation zone must span at 
least two points of each of its corresponding interpolation dimensions." It 
seems to me that "at least" is wrong - it should be "exactly two".

* "The dimensions of an interpolation parameter variable must be a subset of 
zero or more **of** the ...".

* I suggest a rewriting of the part about the dimensions of interpolation 
paramater variable, for clarity, if I've understood it correctly, as follows:

> Where an interpolation zone dimension is provided, the variable provides a 
> single value along that dimension for each interpolation zone, assumed to be 
> defined at the centre of interpolation zone.

> Where a tie point reduced dimension is provided, the variable provides a 
> value for each tie point along that dimension. The value applies to the two 
> interpolation zones on either side of the tie point, and is assumed to be 
> defined at the interpolation zone boundary (figure 3).

> In both cases, the implementation of the interpolation method should assume 
> that an interpolation parameter variable applies equally to all interpolation 
> zones along any interpolation dimension which it does not span.

* For "The bounds of a tie point must be the same as the bounds of the 
corresponding target grid cells," I would suggest, "The bounds of a tie point 
must be the same as the bounds of the target grid cells whose coordinates are 
specified as the tie point."

* I don't understand this sentence: "In this case, though, the tie point index 
variables are the identifying target domain cells to which the bounds apply, 
rather than bounds values themselves."  A tie point index variable could not 
possibly contain bounds values.

* In Example 8.5, you need only one (or maybe two) data variables since they're 
all the same in structure.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://urldefense.us/v3/__https://github.com/cf-convention/cf-conventions/issues/327*issuecomment-859397744__;Iw!!G2kpM7uM-TzIFchu!iGX9_YVR65h2_BfQd8IiUVBizbXiSyxxYn9oae_IQantYdaz7fVqGPaJVaXHsSUjt-QNqT8-6I4$
 
This list forwards relevant notifications from Github.  It is distinct from 
[email protected], although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
[email protected].

Re: [CF-metadata] [cf-convention/cf-conventions] Lossy Compression by Coordinate Sampling (#327)

Reply via email to