paleolimbot commented on code in PR #494:
URL: https://github.com/apache/parquet-format/pull/494#discussion_r2063972056


##########
Geospatial.md:
##########
@@ -94,6 +94,39 @@ Bounding box is defined as the thrift struct below in the 
representation of
 min/max value pair of coordinates from each axis. Note that X and Y Values are
 always present. Z and M are omitted for 2D geospatial instances.
 
+Writers should follow the guidelines below when calculating bounding boxes in
+the presence of edge cases.
+
+* `null` instance: Skip it and continue processing the remaining 
+  geospatial instances. Do not produce a bounding box if all instances are 
null.
+* Non-`null` instance with [invalid geospatial 
values](#invalid-geospatial-values):
+  * X and Y: Skip any invalid X or Y value and continue processing the 
+    remaining X or Y values. Do not produce a bounding box if all X or all Y 
+    values are invalid.
+
+  * Z: Skip any invalid Z value and continue processing the remaining Z values.
+    Omit Z from the bounding box if all Z values are invalid.
+
+  * M: Skip any invalid M value and continue processing the remaining M values.
+    Omit M from the bounding box if all M values are invalid.
+
+Readers should follow the guidelines below when examining bounding boxes. 
+Parquet does not permit `null` or `NaN` values in bounding boxes, whether at 

Review Comment:
   What would a null value in a bounding box be? (Maybe clarify what you mean 
here or remove it?)



##########
Geospatial.md:
##########
@@ -94,6 +94,39 @@ Bounding box is defined as the thrift struct below in the 
representation of
 min/max value pair of coordinates from each axis. Note that X and Y Values are
 always present. Z and M are omitted for 2D geospatial instances.
 
+Writers should follow the guidelines below when calculating bounding boxes in
+the presence of edge cases.
+
+* `null` instance: Skip it and continue processing the remaining 
+  geospatial instances. Do not produce a bounding box if all instances are 
null.
+* Non-`null` instance with [invalid geospatial 
values](#invalid-geospatial-values):
+  * X and Y: Skip any invalid X or Y value and continue processing the 
+    remaining X or Y values. Do not produce a bounding box if all X or all Y 
+    values are invalid.
+
+  * Z: Skip any invalid Z value and continue processing the remaining Z values.
+    Omit Z from the bounding box if all Z values are invalid.
+
+  * M: Skip any invalid M value and continue processing the remaining M values.
+    Omit M from the bounding box if all M values are invalid.
+
+Readers should follow the guidelines below when examining bounding boxes. 
+Parquet does not permit `null` or `NaN` values in bounding boxes, whether at 
+the overall bounding box level or within individual coordinate fields.
+
+* No bounding box: No assumptions can be made about the presence or validity 
+  of coordinate values. Readers may need to load all individual coordinate 
+  values for validation.
+
+* A bounding box is present:
+    * X and Y: Both X and Y of the bounding box must be present.

Review Comment:
   I think this one is important for implementations (although I do think you 
should just say `NaN` instead of "invalid"). Perhaps a concrete way to phrase 
this (unless I'm missing something) would be "If any of `xmin, ymin, xmax, or 
ymax` is `NaN`, the bounding box is not reliable and should not be used".



##########
Geospatial.md:
##########
@@ -162,3 +195,21 @@ The axis order of the coordinates in WKB and bounding box 
stored in Parquet
 follows the de facto standard for axis order in WKB and is therefore always
 (x, y) where x is easting or longitude and y is northing or latitude. This
 ordering explicitly overrides the axis order as specified in the CRS.
+
+# Invalid geospatial values
+
+An invalid geospatial value refers to the coordinate values of a non-`null` 
+geospatial instance that are encoded in a valid WKB format, but are not 
+considered valid values under this specification. While different WKB 
+readers may interpret such values differently, the resulting output should 
+be treated as invalid.
+
+* `NaN`: Not a Number. For example, a `Point` with no X and Y values in WKB is 
+  represented by a `Point` with each ordinate value set to an IEEE-754 quiet 
+  NaN value (hex: `01 01 00 00 00 00 00 00 00 00 00 00 f8 7f 00 00 00 00 00 00 
f8 7f`).

Review Comment:
   ```suggestion
     represented by a `Point` with each ordinate value set to an IEEE-754 
     NaN value (e.g., hex: `01 01 00 00 00 00 00 00 00 00 00 00 f8 7f 00 00 00 
00 00 00 f8 7f`).
     NaN values in other geometry types are typically considered invalid 
geometries by other
     libraries.
   ```
   
   (I don't think quiet NaN is specified anywhere official and I have seen 
signaling NaNs exported once...and it might be worth mentioning that NaNs 
elsewhere are fishy. I think you already mentioned the suggestion for how to 
calculate the bounding box when a writer encounters them above)



##########
Geospatial.md:
##########
@@ -162,3 +195,21 @@ The axis order of the coordinates in WKB and bounding box 
stored in Parquet
 follows the de facto standard for axis order in WKB and is therefore always
 (x, y) where x is easting or longitude and y is northing or latitude. This
 ordering explicitly overrides the axis order as specified in the CRS.
+
+# Invalid geospatial values
+
+An invalid geospatial value refers to the coordinate values of a non-`null` 
+geospatial instance that are encoded in a valid WKB format, but are not 
+considered valid values under this specification. While different WKB 
+readers may interpret such values differently, the resulting output should 
+be treated as invalid.
+

Review Comment:
   Do you want to add a bullet for invalid WKB? Like if somebody puts an empty 
string, what happens? (Currently, Java will error, C++ just omits its 
statistics. I think either of these are fine, although I chose the 'omit' route 
because Parquet doesn't validate JSON or UUIDs either.)



##########
Geospatial.md:
##########
@@ -162,3 +195,19 @@ The axis order of the coordinates in WKB and bounding box 
stored in Parquet
 follows the de facto standard for axis order in WKB and is therefore always
 (x, y) where x is easting or longitude and y is northing or latitude. This
 ordering explicitly overrides the axis order as specified in the CRS.
+
+# Invalid geospatial values
+
+An invalid geospatial value refers to the coordinate values of a non-`null` 
+geospatial instance that are encoded in a valid WKB format, but are not 

Review Comment:
   It's a good point that if you're not going to go in to invalid WKB formats, 
mentioning a valid one is confusing. I think this section is better labeled as 
"Special geospatial values" since many of the things you mention in it are 
perfectly valid (and common) cases.



##########
Geospatial.md:
##########
@@ -162,3 +195,19 @@ The axis order of the coordinates in WKB and bounding box 
stored in Parquet
 follows the de facto standard for axis order in WKB and is therefore always
 (x, y) where x is easting or longitude and y is northing or latitude. This
 ordering explicitly overrides the axis order as specified in the CRS.
+
+# Invalid geospatial values
+
+An invalid geospatial value refers to the coordinate values of a non-`null` 
+geospatial instance that are encoded in a valid WKB format, but are not 
+considered valid values under this specification. While different WKB 
+readers may interpret such values differently, the resulting output should 
+be treated as invalid.
+
+* `NaN`: Not a Number. For example, `POINT EMPTY` in WKB is represented by a 
+  `Point` with each ordinate value set to an IEEE-754 quiet NaN value.
+* `Empty geometries`: Geometries explicitly marked as empty in WKB using 

Review Comment:
   Should these be included in a section labeled "invalid geometries"? They're 
perfectly valid WKB and are likely to show up in reasonable usage...perhaps 
re-label the section to be "Special/Corner-case geometry values to be aware 
of"? Or separate the empties somehow?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to