cholmes commented on code in PR #240:
URL: https://github.com/apache/parquet-format/pull/240#discussion_r1767487031


##########
src/main/thrift/parquet.thrift:
##########
@@ -237,6 +237,97 @@ struct SizeStatistics {
    3: optional list<i64> definition_level_histogram;
 }
 
+/**
+ * Interpretation for edges of GEOMETRY logical type, i.e. whether the edge
+ * between points represent a straight cartesian line or the shortest line on
+ * the sphere. It applies to all non-point geometry objects.
+ */
+enum Edges {
+  PLANAR = 0;
+  SPHERICAL = 1;
+}
+
+/**
+ * A custom binary-encoded polygon or multi-polygon to represent a covering of
+ * geometries. For example, it may be a bounding box or an envelope of 
geometries
+ * when a bounding box cannot be built (e.g. a geometry has spherical edges, 
or if
+ * an edge of geographic coordinates crosses the antimeridian). In addition, 
it can
+ * also be used to provide vendor-agnostic coverings like S2 or H3 grids.
+ */
+struct Covering {
+  /**
+   * A type of covering. Currently accepted values: "WKB".
+   */
+  1: required string kind;

Review Comment:
   Is 'kind' a typical thing in Parquet? If not I'd call this 'type' - kind 
sounds weird to me, but not sure if 'type' as a very specific meaning for 
parquet. This is totally just a stylistic thing, so feel free to ignore. But 
note in the description we say 'a type of covering', not a 'the kind of 
covering'. Kind seems 'fuzzier', that it's wkb or close to it, while type seems 
to indicate just one to me.



##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +467,98 @@ struct JsonType {
 struct BsonType {
 }
 
+/**
+ * Physical type and encoding for the geometry type.
+ */
+enum GeometryEncoding {
+  /**
+   * Allowed for physical type: BYTE_ARRAY.
+   *
+   * Well-known binary (WKB) representations of geometries.
+   *
+   * To be clear, we follow the same rule of WKB and coordinate axis order from
+   * GeoParquet [1][2]. It is the ISO WKB supporting XY, XYZ, XYM, XYZM and the
+   * standard geometry types (Point, LineString, Polygon, MultiPoint,
+   * MultiLineString, MultiPolygon, and GeometryCollection).
+   *
+   * This is the preferred encoding for maximum portability. It also supports
+   * GeometryStatistics to be set in the column chunk and page index.
+   *
+   * [1] 
https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L92
+   * [2] 
https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L155
+   */
+  WKB = 0;
+}
+
+/**
+ * Geometry logical type annotation (added in 2.11.0)
+ */
+struct GeometryType {
+  /**
+   * Physical type and encoding for the geometry type.
+   * Please refer to the definition of GeometryEncoding for more detail.
+   */
+  1: required GeometryEncoding encoding;
+  /**
+   * Edges of geometry type.
+   * Please refer to the definition of Edges for more detail.
+   */
+  2: required Edges edges;
+  /**
+   * Coordinate Reference System, i.e. mapping of how coordinates refer to
+   * precise locations on earth. Writers are not required to set this field.
+   * Once crs is set, crs_encoding field below MUST be set together.
+   * For example, "OGC:CRS84" can be set in the form of PROJJSON as below:
+   * {
+   *     "$schema": "https://proj.org/schemas/v0.5/projjson.schema.json";,
+   *     "type": "GeographicCRS",
+   *     "name": "WGS 84 longitude-latitude",
+   *     "datum": {
+   *         "type": "GeodeticReferenceFrame",
+   *         "name": "World Geodetic System 1984",
+   *         "ellipsoid": {
+   *             "name": "WGS 84",
+   *             "semi_major_axis": 6378137,
+   *             "inverse_flattening": 298.257223563
+   *         }
+   *     },
+   *     "coordinate_system": {
+   *         "subtype": "ellipsoidal",
+   *         "axis": [
+   *         {
+   *             "name": "Geodetic longitude",
+   *             "abbreviation": "Lon",
+   *             "direction": "east",
+   *             "unit": "degree"
+   *         },
+   *         {
+   *             "name": "Geodetic latitude",
+   *             "abbreviation": "Lat",
+   *             "direction": "north",
+   *             "unit": "degree"
+   *         }
+   *         ]
+   *     },
+   *     "id": {
+   *         "authority": "OGC",
+   *         "code": "CRS84"
+   *     }
+   * }
+   */
+  3: optional string crs;
+  /**
+   * Encoding used in the above crs field. It MUST be set if crs field is set.
+   * Currently the only allowed value is "PROJJSON".
+   */
+  4: optional string crs_encoding;
+  /**
+   * Additional informative metadata.
+   * GeoParquet could offload its column metadata in a JSON-encoded UTF-8 
string:
+   * 
https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L46

Review Comment:
   +1 for list<KeyValue> as it sounds like that better matches how things work 
here. I believe used JSON because that was the way to specify in the user space 
that we have.



##########
src/main/thrift/parquet.thrift:
##########
@@ -237,6 +237,97 @@ struct SizeStatistics {
    3: optional list<i64> definition_level_histogram;
 }
 
+/**
+ * Interpretation for edges of GEOMETRY logical type, i.e. whether the edge
+ * between points represent a straight cartesian line or the shortest line on
+ * the sphere. It applies to all non-point geometry objects.
+ */
+enum Edges {

Review Comment:
   I'd say `EdgeInterpolation` is the better name than `EdgeKind`. +1 for 
keeping the attribute of the type the same as GeoParquet for consistency and 
simplicity, but makes sense to me that the enum name could be more descriptive.



##########
src/main/thrift/parquet.thrift:
##########
@@ -237,6 +237,97 @@ struct SizeStatistics {
    3: optional list<i64> definition_level_histogram;
 }
 
+/**
+ * Interpretation for edges of GEOMETRY logical type, i.e. whether the edge
+ * between points represent a straight cartesian line or the shortest line on
+ * the sphere. It applies to all non-point geometry objects.
+ */
+enum Edges {
+  PLANAR = 0;
+  SPHERICAL = 1;
+}
+
+/**
+ * A custom binary-encoded polygon or multi-polygon to represent a covering of
+ * geometries. For example, it may be a bounding box or an envelope of 
geometries
+ * when a bounding box cannot be built (e.g. a geometry has spherical edges, 
or if
+ * an edge of geographic coordinates crosses the antimeridian). In addition, 
it can
+ * also be used to provide vendor-agnostic coverings like S2 or H3 grids.
+ */
+struct Covering {
+  /**
+   * A type of covering. Currently accepted values: "WKB".
+   */
+  1: required string kind;
+  /**
+   * A payload specific to kind. Below are the supported values:
+   * - WKB: well-known binary of a POLYGON or MULTI-POLYGON that completely
+   *   covers the contents. This will be interpreted according to the same CRS
+   *   and edges defined by the logical type.
+   */
+  2: required binary value;
+}
+
+/**
+ * Bounding box of geometries in the representation of min/max value pair of
+ * coordinates from each axis. Values of Z and M are omitted for 2D geometries.
+ * Filter pushdown on geometries using this is only safe for planar spatial
+ * filters.
+ */
+struct BoundingBox {
+  1: required double xmin;
+  2: required double xmax;
+  3: required double ymin;
+  4: required double ymax;
+  5: optional double zmin;
+  6: optional double zmax;
+  7: optional double mmin;
+  8: optional double mmax;
+}
+
+/** Statistics specific to GEOMETRY logical type */
+struct GeometryStatistics {
+  /** A bounding box of geometries */
+  1: optional BoundingBox bbox;
+
+  /**
+   * A list of coverings of geometries.
+   * Note that It is allowed to have more than one covering of the same kind 
and
+   * implementation is free to use any of them. It is recommended to have at 
most
+   * one covering for each kind.
+   */
+  2: optional list<Covering> coverings;
+
+  /**
+   * The geometry types of all geometries, or an empty array if they are not
+   * known. This is borrowed from `geometry_types` column metadata of 
GeoParquet [1]
+   * except that values in the list are WKB (ISO variant) integer codes [2]. 
Table
+   * below shows the most common geometry types and their codes:
+   *
+   * | Type               | XY   | XYZ  | XYM  | XYZM |
+   * | :----------------- | :--- | :--- | :--- | :--: |
+   * | Point              | 0001 | 1001 | 2001 | 3001 |
+   * | LineString         | 0002 | 1002 | 2002 | 3002 |
+   * | Polygon            | 0003 | 1003 | 2003 | 3003 |
+   * | MultiPoint         | 0004 | 1004 | 2004 | 3004 |
+   * | MultiLineString    | 0005 | 1005 | 2005 | 3005 |
+   * | MultiPolygon       | 0006 | 1006 | 2006 | 3006 |
+   * | GeometryCollection | 0007 | 1007 | 2007 | 3007 |
+   *
+   * In addition, the following rules are used:
+   * - A list of multiple values indicates that multiple geometry types are
+   *   present (e.g. `[0003, 0006]`).
+   * - An empty array explicitly signals that the geometry types are not known.
+   * - The geometry types in the list must be unique (e.g. `[0001, 0001]`
+   *   is not valid).
+   *
+   * Please refer to links below for more detail:
+   * [1] 
https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary

Review Comment:
   Yeah, I really wish OGC published these as html documents so we could link 
directly into the relevant section, instead of just saying go to 
https://portal.ogc.org/files/?artifact_id=25355 and look at chapter 8.



##########
src/main/thrift/parquet.thrift:
##########
@@ -237,6 +237,97 @@ struct SizeStatistics {
    3: optional list<i64> definition_level_histogram;
 }
 
+/**
+ * Interpretation for edges of GEOMETRY logical type, i.e. whether the edge
+ * between points represent a straight cartesian line or the shortest line on
+ * the sphere. It applies to all non-point geometry objects.
+ */
+enum Edges {
+  PLANAR = 0;
+  SPHERICAL = 1;
+}
+
+/**
+ * A custom binary-encoded polygon or multi-polygon to represent a covering of
+ * geometries. For example, it may be a bounding box or an envelope of 
geometries
+ * when a bounding box cannot be built (e.g. a geometry has spherical edges, 
or if
+ * an edge of geographic coordinates crosses the antimeridian). In addition, 
it can
+ * also be used to provide vendor-agnostic coverings like S2 or H3 grids.
+ */
+struct Covering {
+  /**
+   * A type of covering. Currently accepted values: "WKB".
+   */
+  1: required string kind;
+  /**
+   * A payload specific to kind. Below are the supported values:
+   * - WKB: well-known binary of a POLYGON or MULTI-POLYGON that completely

Review Comment:
   ```suggestion
      * - WKB: well-known binary [1] of a POLYGON or MULTI-POLYGON that 
completely
   ```
   
   Not sure this formatting works, but this is the first mention of WKB. I 
originally just added it here
   as I'm not sure that the name 'well-known binary' is actually that well 
known. Realized that there is the same link as below (currently line 325), but 
not sure if the [1] does auto-linking, or if this comment in the code should 
explain its own [1] link.



##########
src/main/thrift/parquet.thrift:
##########
@@ -373,6 +467,98 @@ struct JsonType {
 struct BsonType {
 }
 
+/**
+ * Physical type and encoding for the geometry type.
+ */
+enum GeometryEncoding {
+  /**
+   * Allowed for physical type: BYTE_ARRAY.
+   *
+   * Well-known binary (WKB) representations of geometries.
+   *
+   * To be clear, we follow the same rule of WKB and coordinate axis order from
+   * GeoParquet [1][2]. It is the ISO WKB supporting XY, XYZ, XYM, XYZM and the
+   * standard geometry types (Point, LineString, Polygon, MultiPoint,
+   * MultiLineString, MultiPolygon, and GeometryCollection).
+   *
+   * This is the preferred encoding for maximum portability. It also supports
+   * GeometryStatistics to be set in the column chunk and page index.
+   *
+   * [1] 
https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L92
+   * [2] 
https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L155
+   */
+  WKB = 0;
+}
+
+/**
+ * Geometry logical type annotation (added in 2.11.0)
+ */
+struct GeometryType {
+  /**
+   * Physical type and encoding for the geometry type.
+   * Please refer to the definition of GeometryEncoding for more detail.
+   */
+  1: required GeometryEncoding encoding;
+  /**
+   * Edges of geometry type.

Review Comment:
   ```suggestion
      * Interpretation for edges of elements of a GEOMETRY logical type, i.e. 
whether the interpolation 
      * between points along an edge represents a straight cartesian line or 
the shortest line on the sphere.
   ```
   In the other comments there seemed to be desire to keep the 'edges' name 
here, but this documentation is so short that it seemed useful to at least give 
a short overview before referring people to a more in depth definition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to