jiayuasu commented on code in PR #240:
URL: https://github.com/apache/parquet-format/pull/240#discussion_r1696385308
##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +453,51 @@ struct JsonType {
struct BsonType {
}
+/**
+ * Physical type and encoding for the geometry type.
+ */
+enum GeometryEncoding {
+ /**
+ * Allowed for physical type: BYTE_ARRAY.
+ *
+ * Well-known binary (WKB) representations of geometries. It supports 2D or
+ * 3D geometries of the standard geometry types (Point, LineString, Polygon,
+ * MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection). This
+ * is the preferred option for maximum portability.
+ *
+ * This encoding enables GeometryStatistics to be set in the column chunk
+ * and page index.
+ */
+ WKB = 0;
+
+ // TODO: add native encoding from GeoParquet/GeoArrow
+}
+
+/**
+ * Geometry logical type annotation (added in 2.11.0)
+ */
+struct GeometryType {
+ /**
+ * Physical type and encoding for the geometry type. Please refer to the
+ * definition of GeometryEncoding for more detail.
+ */
+ 1: required GeometryEncoding encoding;
+ /**
+ * Edges of polygon.
+ */
+ 2: required Edges edges;
+ /**
+ * Coordinate Reference System, i.e. mapping of how coordinates refer to
+ * precise locations on earth, e.g. OGC:CRS84
+ */
+ 3: optional string crs;
Review Comment:
@paleolimbot @wgtmac @Kontinuation @zhangfengcdt Please see the latest
comment from @desruisseaux on the Iceberg Geometry PR
(https://github.com/apache/iceberg/issues/10260). I also discussed this issue
with the GeoParquet community today.
In short, some people like projjson given its popularity and completeness,
some people don't like it because it is still incomplete and there is another
on-going effort at OGC CRS working group called CRSJSON to replace it, some
other people like SRID (comments from Snowflake).
We'd better define what CRS encoding is used in this optional CRS string by
using an additional enum field to avoid future confusion.
Example:
```
5. optional enum CRSEncoding {
WKT2 = 0;
SRID = 1; // Format: AUTHORITY:CODE
PROJJSON = 2;
// Future work when CRSJSON is completed: add CRSJSON = 3
}
```
Some members from the GeoParquet community think allowing multiple encoding
of CRS introduce additional overhead to the implementer. I would argue that
this is not a problem because:
1. The reader / writer does not need to support all CRS Encodings. It is
optional anyway.
2. Even if we use projjson, a large number of engines in the Java world
(Hive, Trino, Presto, Sedona/Spark, Sedona/Flink) cannot comprehend it because
no Java library for it.
3. The CRS encoding is purely a description field. If an engine does not
understand it, just carry it along the way.
In addition, the next GeoParquet community sync is August 12. Please join us
for the discussion of this topic: Zoom meeting:
https://github.com/opengeospatial/geoparquet/discussions/240
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]