paleolimbot commented on code in PR #240:
URL: https://github.com/apache/parquet-format/pull/240#discussion_r1615969004


##########
src/main/thrift/parquet.thrift:
##########
@@ -237,6 +237,38 @@ struct SizeStatistics {
    3: optional list<i64> definition_level_histogram;
 }
 
+/**
+ * Bounding box of geometries in the representation of min/max value pair of
+ * coordinates from each axis. Values of Z and M are omitted for 2D geometries.
+ */
+struct BoundingBox {
+  1: optional double x_min;
+  2: optional double x_max;
+  3: optional double y_min;
+  4: optional double y_max;

Review Comment:
   If it makes sense in thrift, I think these should be required (i.e., if you 
are going to put in a bounding box, you must include x and y).
   
   nit: I have never seen an underscore in the definition of `x_min` (and 
friends) in the open source geospatial world. I have seen `minx` and 
`xmin`...my personal choice would be `xmin`.



##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +408,74 @@ struct JsonType {
 struct BsonType {
 }
 
+/**
+ * Phyiscal type and encoding for the geometry type.
+ */
+enum GeometryEncoding {
+  /**
+   * Allowed for phyiscal type: BYTE_ARRAY.
+   *
+   * Well-known binary (WKB) representations of geometries. It supports 2D or
+   * 3D geometries of the standard geometry types (Point, LineString, Polygon,
+   * MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection). This
+   * is the preferred option for maximum portability.
+   *
+   * This encoding enables GeometryStatistics to be set in the column chunk
+   * and page index.
+   */
+  WKB = 0;
+
+  /**
+   * Encodings from POINT to MULTIPOLYGON below are specialized for single
+   * geometry type and inspired by GeoArrow (https://geoarrow.org/format.html)
+   * native encodings. It uses the separated (struct) representation of
+   * coordinates for single-geometry type encodings because this encoding
+   * results in useful column statistics when row groups and/or files contain
+   * related features.
+   *
+   * WARNING: GeometryStatistics cannot be enabled for these encodings because
+   * only leaf columns can have column statistics and page index.
+   *
+   * The actual coordinates of the geometries MUST be stored as native numbers,
+   * i.e. using the DOUBLE type in a (repeated) group of fields (exact
+   * repetition depending on the geometry type).
+   *
+   * For the POINT encoding, this results in a struct of two fields for x and y
+   * coordinates (in case of 2D geometries):
+   * optional group geometry {
+   *   required double x;
+   *   required double y;
+   * }
+   *
+   * For more detail, please refer to link below:
+   * 
https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#encoding
+   */
+  POINT = 1;
+  LINESTRING = 2;
+  POLYGON = 3;
+  MULTIPOINT = 4;
+  MULTILINESTRING = 5;
+  MULTIPOLYGON = 6;
+}
+
+/**
+ * Geometry logical type annotation (added in 2.11.0)
+ */
+struct GeometryType {
+  /**
+   * Phyiscal type and encoding for the geometry type. Please refer to the
+   * definition of GeometryEncoding for more detail.
+   */
+  1: required GeometryEncoding encoding;
+  /**
+   * Additional informative metadata.
+   * It can be used by GeoParquet to offload some of the column metadata.
+   */
+  2: optional string metadata;
+  /** File-level statistics for geometries */
+  3: optional GeometryStatistics statistics;

Review Comment:
   I don't personally think that this belongs with the type...if Parquet would 
like to add file-level statistics, it should probably do it the same for all 
types rather than just for geometry. GeoParquet put those statistics at the 
file level because there was no other place to put them when we first created 
the standard.



##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +408,74 @@ struct JsonType {
 struct BsonType {
 }
 
+/**
+ * Phyiscal type and encoding for the geometry type.
+ */
+enum GeometryEncoding {
+  /**
+   * Allowed for phyiscal type: BYTE_ARRAY.
+   *
+   * Well-known binary (WKB) representations of geometries. It supports 2D or
+   * 3D geometries of the standard geometry types (Point, LineString, Polygon,
+   * MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection). This
+   * is the preferred option for maximum portability.
+   *
+   * This encoding enables GeometryStatistics to be set in the column chunk
+   * and page index.
+   */
+  WKB = 0;
+
+  /**
+   * Encodings from POINT to MULTIPOLYGON below are specialized for single
+   * geometry type and inspired by GeoArrow (https://geoarrow.org/format.html)
+   * native encodings. It uses the separated (struct) representation of
+   * coordinates for single-geometry type encodings because this encoding
+   * results in useful column statistics when row groups and/or files contain
+   * related features.
+   *
+   * WARNING: GeometryStatistics cannot be enabled for these encodings because
+   * only leaf columns can have column statistics and page index.
+   *
+   * The actual coordinates of the geometries MUST be stored as native numbers,
+   * i.e. using the DOUBLE type in a (repeated) group of fields (exact
+   * repetition depending on the geometry type).
+   *
+   * For the POINT encoding, this results in a struct of two fields for x and y
+   * coordinates (in case of 2D geometries):
+   * optional group geometry {
+   *   required double x;
+   *   required double y;
+   * }
+   *
+   * For more detail, please refer to link below:
+   * 
https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#encoding
+   */
+  POINT = 1;
+  LINESTRING = 2;
+  POLYGON = 3;
+  MULTIPOINT = 4;
+  MULTILINESTRING = 5;
+  MULTIPOLYGON = 6;
+}
+
+/**
+ * Geometry logical type annotation (added in 2.11.0)
+ */
+struct GeometryType {
+  /**
+   * Phyiscal type and encoding for the geometry type. Please refer to the
+   * definition of GeometryEncoding for more detail.
+   */
+  1: required GeometryEncoding encoding;
+  /**
+   * Additional informative metadata.
+   * It can be used by GeoParquet to offload some of the column metadata.
+   */
+  2: optional string metadata;

Review Comment:
   One absolutely critical reason to do this is that one cannot calculate a 
bounding box in the normal way if a geometry has spherical edges. Similarly, 
for geographic coordinates, if an edge crosses the antimeridian, the bbox will 
be invalid. If we include those details as part of the logical type, geo-naive 
writers can safely fill in column statistics with a fairly basic WKB parser.



##########
src/main/thrift/parquet.thrift:
##########
@@ -237,6 +237,38 @@ struct SizeStatistics {
    3: optional list<i64> definition_level_histogram;
 }
 
+/**
+ * Bounding box of geometries in the representation of min/max value pair of
+ * coordinates from each axis. Values of Z and M are omitted for 2D geometries.
+ */
+struct BoundingBox {
+  1: optional double x_min;
+  2: optional double x_max;
+  3: optional double y_min;
+  4: optional double y_max;
+  5: optional double z_min;
+  6: optional double z_max;
+  7: optional double m_min;
+  8: optional double m_max;
+}
+
+/** Statistics specific to GEOMETRY logical type */
+struct GeometryStatistics {
+  /** Bounding box of geometries */
+  1: optional BoundingBox bbox;
+  /** Covering of geometries as a list of Google S2 cell ids */
+  2: list<i64> s2_cell_ids;
+  /** Covering of geometries as a list of Uber H3 indices */
+  3: list<i64> h3_indices;
+  /**
+   * The geometry types of all geometries, or an empty array if they are not
+   * known. It follows the same rule of `geometry_types` column metadata of
+   * GeoParquet. Accepted geometry types are: "Point", "LineString", "Polygon",
+   * "MultiPoint", "MultiLineString", "MultiPolygon", "GeometryCollection".

Review Comment:
   I think this should also include dimension (e.g., `Point Z`), which would 
align with how we did this in the GeoParquet standard ( 
https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#geometry_types
 ).



##########
src/main/thrift/parquet.thrift:
##########
@@ -237,6 +237,38 @@ struct SizeStatistics {
    3: optional list<i64> definition_level_histogram;
 }
 
+/**
+ * Bounding box of geometries in the representation of min/max value pair of
+ * coordinates from each axis. Values of Z and M are omitted for 2D geometries.
+ */
+struct BoundingBox {
+  1: optional double x_min;
+  2: optional double x_max;
+  3: optional double y_min;
+  4: optional double y_max;
+  5: optional double z_min;
+  6: optional double z_max;
+  7: optional double m_min;
+  8: optional double m_max;
+}
+
+/** Statistics specific to GEOMETRY logical type */
+struct GeometryStatistics {
+  /** Bounding box of geometries */
+  1: optional BoundingBox bbox;
+  /** Covering of geometries as a list of Google S2 cell ids */
+  2: list<i64> s2_cell_ids;
+  /** Covering of geometries as a list of Uber H3 indices */
+  3: list<i64> h3_indices;

Review Comment:
   S2 and H3 are maybe similar to Snappy compression here (there are widely 
used open source implementations each supported by one company but no 
"standard" in the ISO or OGC sense).
   
   That said, I think that a more future-proof/vendor-agnostic thing to put 
here would be the bytes of a WKB-encoded polygon (which can encode an s2 or h3 
covering, albeit less compactly). It would also need an `edges` field that is 
independent of the edges of the logical type (s2 and h3 coverings both have 
spherical edges but the data may not).



##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +408,74 @@ struct JsonType {
 struct BsonType {
 }
 
+/**
+ * Phyiscal type and encoding for the geometry type.
+ */
+enum GeometryEncoding {
+  /**
+   * Allowed for phyiscal type: BYTE_ARRAY.
+   *
+   * Well-known binary (WKB) representations of geometries. It supports 2D or
+   * 3D geometries of the standard geometry types (Point, LineString, Polygon,
+   * MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection). This
+   * is the preferred option for maximum portability.
+   *
+   * This encoding enables GeometryStatistics to be set in the column chunk
+   * and page index.
+   */
+  WKB = 0;
+
+  /**
+   * Encodings from POINT to MULTIPOLYGON below are specialized for single
+   * geometry type and inspired by GeoArrow (https://geoarrow.org/format.html)
+   * native encodings. It uses the separated (struct) representation of
+   * coordinates for single-geometry type encodings because this encoding
+   * results in useful column statistics when row groups and/or files contain
+   * related features.
+   *
+   * WARNING: GeometryStatistics cannot be enabled for these encodings because
+   * only leaf columns can have column statistics and page index.
+   *
+   * The actual coordinates of the geometries MUST be stored as native numbers,
+   * i.e. using the DOUBLE type in a (repeated) group of fields (exact
+   * repetition depending on the geometry type).
+   *
+   * For the POINT encoding, this results in a struct of two fields for x and y
+   * coordinates (in case of 2D geometries):
+   * optional group geometry {
+   *   required double x;
+   *   required double y;
+   * }
+   *
+   * For more detail, please refer to link below:
+   * 
https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#encoding
+   */
+  POINT = 1;
+  LINESTRING = 2;
+  POLYGON = 3;
+  MULTIPOINT = 4;
+  MULTILINESTRING = 5;
+  MULTIPOLYGON = 6;
+}
+
+/**
+ * Geometry logical type annotation (added in 2.11.0)
+ */
+struct GeometryType {
+  /**
+   * Phyiscal type and encoding for the geometry type. Please refer to the
+   * definition of GeometryEncoding for more detail.
+   */
+  1: required GeometryEncoding encoding;
+  /**
+   * Additional informative metadata.
+   * It can be used by GeoParquet to offload some of the column metadata.
+   */
+  2: optional string metadata;

Review Comment:
   I know I was supportive of doing this initially; however, I think this would 
better accomplish its goal of interoperability to explicitly spell out `crs` 
and `edges` in thrift language (as you did in a previous version of this PR). 
(As long are the Parquet community will be receptive to the open source 
geospatial community's feedback when current best practice changes, and so far 
it seems as though they are 🙂 )



##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +408,74 @@ struct JsonType {
 struct BsonType {
 }
 
+/**
+ * Phyiscal type and encoding for the geometry type.
+ */
+enum GeometryEncoding {
+  /**
+   * Allowed for phyiscal type: BYTE_ARRAY.
+   *
+   * Well-known binary (WKB) representations of geometries. It supports 2D or
+   * 3D geometries of the standard geometry types (Point, LineString, Polygon,
+   * MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection). This
+   * is the preferred option for maximum portability.
+   *
+   * This encoding enables GeometryStatistics to be set in the column chunk
+   * and page index.
+   */
+  WKB = 0;
+
+  /**
+   * Encodings from POINT to MULTIPOLYGON below are specialized for single
+   * geometry type and inspired by GeoArrow (https://geoarrow.org/format.html)
+   * native encodings. It uses the separated (struct) representation of
+   * coordinates for single-geometry type encodings because this encoding
+   * results in useful column statistics when row groups and/or files contain
+   * related features.
+   *
+   * WARNING: GeometryStatistics cannot be enabled for these encodings because
+   * only leaf columns can have column statistics and page index.

Review Comment:
   In this case, the statistics for the leaf columns contain equivalent 
information to the bounding box (which might be worth mentioning here).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to