wgtmac commented on code in PR #240:
URL: https://github.com/apache/parquet-format/pull/240#discussion_r1597471889
##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +376,69 @@ struct JsonType {
struct BsonType {
}
+/**
+ * A geometry can be any of the following subtypes.
+ * The list of geospatial subtypes is taken from the OGC (Open Geospatial
Consortium)
+ * SFA (Simple Feature Access) Part 1- Common Architecture.
+ */
+enum GeometrySubType {
+ POINT = 0;
+ LINESTRING = 1;
+ POLYGON = 2;
+ MULTIPOINT = 3;
+ MULTILINESTRING = 4;
+ MULTIPOLYGON = 5;
+ GEOMETRY_COLLECTION = 6;
+}
+
+/**
+ * Interpretation for edges, i.e. whether the edge between points
+ * represent a straight cartesian line or the shortest line on the sphere
+ */
+enum Edges {
+ PLANAR = 0;
+ // SPHERICAL = 1; // not supported yet
+}
+
+/**
+ * Well-Known Binary. This is a well-known and popular binary representation
regulated
+ * by the Open Geospatial Consortium (OGC).
+ */
+struct WKB {}
+/**
+ * Encoding for geospatial data.
+ */
+union GeospatialEncoding {
+ 1: WKB WKB
+}
+
+/**
+ * Geometry logical type annotation
+ *
+ * Allowed for physical types: BINARY (added in 2.11.0)
+ */
+struct GeometryType {
+ /**
+ * The subtype of the geometry.
+ * If set, all values in the column must be of the same subtype.
+ * If not set, the column may contain values of any subtype.
+ */
+ 1: optional GeometrySubType subtype;
+ /**
+ * The dimension of the geometry.
+ * For now only 2D geometry is supported and the value must be 2 if set.
+ */
+ 2: optional byte dimension;
Review Comment:
Perhaps we can define different `ColumnOrder` for 3D? Any good zone-map
candidate to use for 3D?
##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +376,69 @@ struct JsonType {
struct BsonType {
}
+/**
+ * A geometry can be any of the following subtypes.
+ * The list of geospatial subtypes is taken from the OGC (Open Geospatial
Consortium)
+ * SFA (Simple Feature Access) Part 1- Common Architecture.
+ */
+enum GeometrySubType {
+ POINT = 0;
+ LINESTRING = 1;
+ POLYGON = 2;
+ MULTIPOINT = 3;
+ MULTILINESTRING = 4;
+ MULTIPOLYGON = 5;
+ GEOMETRY_COLLECTION = 6;
+}
+
+/**
+ * Interpretation for edges, i.e. whether the edge between points
+ * represent a straight cartesian line or the shortest line on the sphere
+ */
+enum Edges {
+ PLANAR = 0;
+ // SPHERICAL = 1; // not supported yet
+}
+
+/**
+ * Well-Known Binary. This is a well-known and popular binary representation
regulated
+ * by the Open Geospatial Consortium (OGC).
+ */
+struct WKB {}
+/**
+ * Encoding for geospatial data.
+ */
+union GeospatialEncoding {
+ 1: WKB WKB
+}
+
+/**
+ * Geometry logical type annotation
+ *
+ * Allowed for physical types: BINARY (added in 2.11.0)
Review Comment:
This is something that we need to discuss and explore further. What's in my
mind is that we can define some fixed complex types and add `GEOMETRY` logical
type to the root of each complex type.
##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +376,69 @@ struct JsonType {
struct BsonType {
}
+/**
+ * A geometry can be any of the following subtypes.
+ * The list of geospatial subtypes is taken from the OGC (Open Geospatial
Consortium)
+ * SFA (Simple Feature Access) Part 1- Common Architecture.
+ */
+enum GeometrySubType {
+ POINT = 0;
+ LINESTRING = 1;
+ POLYGON = 2;
+ MULTIPOINT = 3;
+ MULTILINESTRING = 4;
+ MULTIPOLYGON = 5;
+ GEOMETRY_COLLECTION = 6;
+}
+
+/**
+ * Interpretation for edges, i.e. whether the edge between points
+ * represent a straight cartesian line or the shortest line on the sphere
+ */
+enum Edges {
+ PLANAR = 0;
+ // SPHERICAL = 1; // not supported yet
+}
+
+/**
+ * Well-Known Binary. This is a well-known and popular binary representation
regulated
+ * by the Open Geospatial Consortium (OGC).
+ */
+struct WKB {}
+/**
+ * Encoding for geospatial data.
+ */
+union GeospatialEncoding {
+ 1: WKB WKB
Review Comment:
I just follow the other `union`s in this file. It would be much cleaner to
use enum if all geospatial encodings do not require any extra parameter. Using
`union` here would have better extensibility if future encoding carries
parameters.
##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +376,69 @@ struct JsonType {
struct BsonType {
}
+/**
+ * A geometry can be any of the following subtypes.
+ * The list of geospatial subtypes is taken from the OGC (Open Geospatial
Consortium)
+ * SFA (Simple Feature Access) Part 1- Common Architecture.
+ */
+enum GeometrySubType {
Review Comment:
Yes, this is just informative. I'm open to keep or delete it.
##########
src/main/thrift/parquet.thrift:
##########
@@ -270,8 +270,11 @@ struct Statistics {
* may set min_value="B", max_value="C". Such more compact values must
still be
* valid values within the column's logical type.
*
- * Values are encoded using PLAIN encoding, except that variable-length byte
- * arrays do not include a length prefix.
+ * Values are encoded using PLAIN encoding, except that:
+ * 1) variable-length byte arrays do not include a length prefix.
+ * 2) geometry logical type with BoundingBoxOrder uses max_value/min_value
pair
Review Comment:
> trying to wrap my head around how writer implementations in WKB case can
get min_value/max_value
We have two options here:
1. If the input data is in a well-defined geometry object, collecting
bounding box should be easy but the interface of the parquet library would be
complicated.
2. The parquet writer only accepts WKB-encoded binary data, then the writer
is required to deserialize values to get the coordinates which degrade the
performance.
##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +376,69 @@ struct JsonType {
struct BsonType {
}
+/**
+ * A geometry can be any of the following subtypes.
+ * The list of geospatial subtypes is taken from the OGC (Open Geospatial
Consortium)
+ * SFA (Simple Feature Access) Part 1- Common Architecture.
+ */
+enum GeometrySubType {
+ POINT = 0;
+ LINESTRING = 1;
+ POLYGON = 2;
+ MULTIPOINT = 3;
+ MULTILINESTRING = 4;
+ MULTIPOLYGON = 5;
+ GEOMETRY_COLLECTION = 6;
+}
+
+/**
+ * Interpretation for edges, i.e. whether the edge between points
+ * represent a straight cartesian line or the shortest line on the sphere
+ */
+enum Edges {
+ PLANAR = 0;
+ // SPHERICAL = 1; // not supported yet
+}
+
+/**
+ * Well-Known Binary. This is a well-known and popular binary representation
regulated
+ * by the Open Geospatial Consortium (OGC).
+ */
+struct WKB {}
+/**
+ * Encoding for geospatial data.
+ */
+union GeospatialEncoding {
+ 1: WKB WKB
+}
+
+/**
+ * Geometry logical type annotation
+ *
+ * Allowed for physical types: BINARY (added in 2.11.0)
+ */
+struct GeometryType {
+ /**
+ * The subtype of the geometry.
+ * If set, all values in the column must be of the same subtype.
+ * If not set, the column may contain values of any subtype.
+ */
+ 1: optional GeometrySubType subtype;
+ /**
+ * The dimension of the geometry.
+ * For now only 2D geometry is supported and the value must be 2 if set.
+ */
+ 2: optional byte dimension;
+ /**
+ * Coordinate Reference System, i.e. mapping of how coordinates refer to
+ * precise locations on earth.
+ * For now only OGC:CRS84 is supported.
+ */
+ 3: optional string crs;
+ 4: required Edges edges;
Review Comment:
Let me take a look.
##########
src/main/thrift/parquet.thrift:
##########
@@ -270,8 +270,11 @@ struct Statistics {
* may set min_value="B", max_value="C". Such more compact values must
still be
* valid values within the column's logical type.
*
- * Values are encoded using PLAIN encoding, except that variable-length byte
- * arrays do not include a length prefix.
+ * Values are encoded using PLAIN encoding, except that:
+ * 1) variable-length byte arrays do not include a length prefix.
+ * 2) geometry logical type with BoundingBoxOrder uses max_value/min_value
pair
Review Comment:
You're right. As you can see that `max_value` and `min_value` here are of
`binary` type, which is the serialized form and cannot be consumed directly.
AFAIK, the
[C++](https://github.com/apache/arrow/blob/1e3772cac5f45edb6ada3d20140b77cc86208346/cpp/src/parquet/statistics.h#L286-L290)
and
[Java](https://github.com/apache/parquet-mr/blob/c241170d9bc2cd8415b04e06ecea40ed3d80f64d/parquet-column/src/main/java/org/apache/parquet/column/statistics/DoubleStatistics.java#L133-L139)
Parquet implementations have provided functions to access the deserialized
values from the `Statistics` class. We can also define an easy-to-use bounding
box class and get it from `Statistics` implementation, but that's not the topic
of the specs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]