paleolimbot commented on code in PR #494:
URL: https://github.com/apache/parquet-format/pull/494#discussion_r2063972056
##########
Geospatial.md:
##########
@@ -94,6 +94,39 @@ Bounding box is defined as the thrift struct below in the
representation of
min/max value pair of coordinates from each axis. Note that X and Y Values are
always present. Z and M are omitted for 2D geospatial instances.
+Writers should follow the guidelines below when calculating bounding boxes in
+the presence of edge cases.
+
+* `null` instance: Skip it and continue processing the remaining
+ geospatial instances. Do not produce a bounding box if all instances are
null.
+* Non-`null` instance with [invalid geospatial
values](#invalid-geospatial-values):
+ * X and Y: Skip any invalid X or Y value and continue processing the
+ remaining X or Y values. Do not produce a bounding box if all X or all Y
+ values are invalid.
+
+ * Z: Skip any invalid Z value and continue processing the remaining Z values.
+ Omit Z from the bounding box if all Z values are invalid.
+
+ * M: Skip any invalid M value and continue processing the remaining M values.
+ Omit M from the bounding box if all M values are invalid.
+
+Readers should follow the guidelines below when examining bounding boxes.
+Parquet does not permit `null` or `NaN` values in bounding boxes, whether at
Review Comment:
What would a null value in a bounding box be? (Maybe clarify what you mean
here or remove it?)
##########
Geospatial.md:
##########
@@ -94,6 +94,39 @@ Bounding box is defined as the thrift struct below in the
representation of
min/max value pair of coordinates from each axis. Note that X and Y Values are
always present. Z and M are omitted for 2D geospatial instances.
+Writers should follow the guidelines below when calculating bounding boxes in
+the presence of edge cases.
+
+* `null` instance: Skip it and continue processing the remaining
+ geospatial instances. Do not produce a bounding box if all instances are
null.
+* Non-`null` instance with [invalid geospatial
values](#invalid-geospatial-values):
+ * X and Y: Skip any invalid X or Y value and continue processing the
+ remaining X or Y values. Do not produce a bounding box if all X or all Y
+ values are invalid.
+
+ * Z: Skip any invalid Z value and continue processing the remaining Z values.
+ Omit Z from the bounding box if all Z values are invalid.
+
+ * M: Skip any invalid M value and continue processing the remaining M values.
+ Omit M from the bounding box if all M values are invalid.
+
+Readers should follow the guidelines below when examining bounding boxes.
+Parquet does not permit `null` or `NaN` values in bounding boxes, whether at
+the overall bounding box level or within individual coordinate fields.
+
+* No bounding box: No assumptions can be made about the presence or validity
+ of coordinate values. Readers may need to load all individual coordinate
+ values for validation.
+
+* A bounding box is present:
+ * X and Y: Both X and Y of the bounding box must be present.
Review Comment:
I think this one is important for implementations (although I do think you
should just say `NaN` instead of "invalid"). Perhaps a concrete way to phrase
this (unless I'm missing something) would be "If any of `xmin, ymin, xmax, or
ymax` is `NaN`, the bounding box is not reliable and should not be used".
##########
Geospatial.md:
##########
@@ -162,3 +195,21 @@ The axis order of the coordinates in WKB and bounding box
stored in Parquet
follows the de facto standard for axis order in WKB and is therefore always
(x, y) where x is easting or longitude and y is northing or latitude. This
ordering explicitly overrides the axis order as specified in the CRS.
+
+# Invalid geospatial values
+
+An invalid geospatial value refers to the coordinate values of a non-`null`
+geospatial instance that are encoded in a valid WKB format, but are not
+considered valid values under this specification. While different WKB
+readers may interpret such values differently, the resulting output should
+be treated as invalid.
+
+* `NaN`: Not a Number. For example, a `Point` with no X and Y values in WKB is
+ represented by a `Point` with each ordinate value set to an IEEE-754 quiet
+ NaN value (hex: `01 01 00 00 00 00 00 00 00 00 00 00 f8 7f 00 00 00 00 00 00
f8 7f`).
Review Comment:
```suggestion
represented by a `Point` with each ordinate value set to an IEEE-754
NaN value (e.g., hex: `01 01 00 00 00 00 00 00 00 00 00 00 f8 7f 00 00 00
00 00 00 f8 7f`).
NaN values in other geometry types are typically considered invalid
geometries by other
libraries.
```
(I don't think quiet NaN is specified anywhere official and I have seen
signaling NaNs exported once...and it might be worth mentioning that NaNs
elsewhere are fishy. I think you already mentioned the suggestion for how to
calculate the bounding box when a writer encounters them above)
##########
Geospatial.md:
##########
@@ -162,3 +195,21 @@ The axis order of the coordinates in WKB and bounding box
stored in Parquet
follows the de facto standard for axis order in WKB and is therefore always
(x, y) where x is easting or longitude and y is northing or latitude. This
ordering explicitly overrides the axis order as specified in the CRS.
+
+# Invalid geospatial values
+
+An invalid geospatial value refers to the coordinate values of a non-`null`
+geospatial instance that are encoded in a valid WKB format, but are not
+considered valid values under this specification. While different WKB
+readers may interpret such values differently, the resulting output should
+be treated as invalid.
+
Review Comment:
Do you want to add a bullet for invalid WKB? Like if somebody puts an empty
string, what happens? (Currently, Java will error, C++ just omits its
statistics. I think either of these are fine, although I chose the 'omit' route
because Parquet doesn't validate JSON or UUIDs either.)
##########
Geospatial.md:
##########
@@ -162,3 +195,19 @@ The axis order of the coordinates in WKB and bounding box
stored in Parquet
follows the de facto standard for axis order in WKB and is therefore always
(x, y) where x is easting or longitude and y is northing or latitude. This
ordering explicitly overrides the axis order as specified in the CRS.
+
+# Invalid geospatial values
+
+An invalid geospatial value refers to the coordinate values of a non-`null`
+geospatial instance that are encoded in a valid WKB format, but are not
Review Comment:
It's a good point that if you're not going to go in to invalid WKB formats,
mentioning a valid one is confusing. I think this section is better labeled as
"Special geospatial values" since many of the things you mention in it are
perfectly valid (and common) cases.
##########
Geospatial.md:
##########
@@ -162,3 +195,19 @@ The axis order of the coordinates in WKB and bounding box
stored in Parquet
follows the de facto standard for axis order in WKB and is therefore always
(x, y) where x is easting or longitude and y is northing or latitude. This
ordering explicitly overrides the axis order as specified in the CRS.
+
+# Invalid geospatial values
+
+An invalid geospatial value refers to the coordinate values of a non-`null`
+geospatial instance that are encoded in a valid WKB format, but are not
+considered valid values under this specification. While different WKB
+readers may interpret such values differently, the resulting output should
+be treated as invalid.
+
+* `NaN`: Not a Number. For example, `POINT EMPTY` in WKB is represented by a
+ `Point` with each ordinate value set to an IEEE-754 quiet NaN value.
+* `Empty geometries`: Geometries explicitly marked as empty in WKB using
Review Comment:
Should these be included in a section labeled "invalid geometries"? They're
perfectly valid WKB and are likely to show up in reasonable usage...perhaps
re-label the section to be "Special/Corner-case geometry values to be aware
of"? Or separate the empties somehow?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]