szehon-ho commented on code in PR #240:
URL: https://github.com/apache/parquet-format/pull/240#discussion_r1618165783
##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +408,74 @@ struct JsonType {
struct BsonType {
}
+/**
+ * Phyiscal type and encoding for the geometry type.
+ */
+enum GeometryEncoding {
+ /**
+ * Allowed for phyiscal type: BYTE_ARRAY.
+ *
+ * Well-known binary (WKB) representations of geometries. It supports 2D or
+ * 3D geometries of the standard geometry types (Point, LineString, Polygon,
+ * MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection). This
+ * is the preferred option for maximum portability.
+ *
+ * This encoding enables GeometryStatistics to be set in the column chunk
+ * and page index.
+ */
+ WKB = 0;
+
+ /**
+ * Encodings from POINT to MULTIPOLYGON below are specialized for single
+ * geometry type and inspired by GeoArrow (https://geoarrow.org/format.html)
+ * native encodings. It uses the separated (struct) representation of
+ * coordinates for single-geometry type encodings because this encoding
+ * results in useful column statistics when row groups and/or files contain
+ * related features.
+ *
+ * WARNING: GeometryStatistics cannot be enabled for these encodings because
+ * only leaf columns can have column statistics and page index.
Review Comment:
I think WKB = 0 is pretty clear, but the GeoArrow ones are not so clear
because they dictate that entire type must be one Geometry subtype (ie all
Points, or all Lines, etc)? And so it can be limiting for users.
Maybe as you already saw, but we were chatting with @jiayuasu on the Iceberg
proposal , and they had experimented with encodings like [GeoLake
native](https://wherobots.notion.site/EXT-Geometry-encoding-type-and-expression-in-Iceberg-193f84ff42b44d2db326dc43f753598f#89e7f7a23e28437985db0ecc724f6127),
that is a native encoding but re-uses same schema to represent all Geometry
subtypes?
As it may be a longer debate, may be we can delay adding non-WKB ones in
this first pass?
##########
src/main/thrift/parquet.thrift:
##########
@@ -237,6 +237,38 @@ struct SizeStatistics {
3: optional list<i64> definition_level_histogram;
}
+/**
+ * Bounding box of geometries in the representation of min/max value pair of
+ * coordinates from each axis. Values of Z and M are omitted for 2D geometries.
+ */
+struct BoundingBox {
+ 1: optional double x_min;
+ 2: optional double x_max;
+ 3: optional double y_min;
+ 4: optional double y_max;
+ 5: optional double z_min;
+ 6: optional double z_max;
+ 7: optional double m_min;
+ 8: optional double m_max;
+}
+
+/** Statistics specific to GEOMETRY logical type */
+struct GeometryStatistics {
+ /** Bounding box of geometries */
+ 1: optional BoundingBox bbox;
+ /** Covering of geometries as a list of Google S2 cell ids */
+ 2: list<i64> s2_cell_ids;
+ /** Covering of geometries as a list of Uber H3 indices */
+ 3: list<i64> h3_indices;
+ /**
+ * The geometry types of all geometries, or an empty array if they are not
+ * known. It follows the same rule of `geometry_types` column metadata of
+ * GeoParquet. Accepted geometry types are: "Point", "LineString", "Polygon",
+ * "MultiPoint", "MultiLineString", "MultiPolygon", "GeometryCollection".
Review Comment:
Hi, I am curious what is the use case for having this metadata? Is it to do
some kind of file/row-group level filtering based on what types we saved?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]