This is an automated email from the ASF dual-hosted git repository.
gangwu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 94b9d63 PARQUET-2471: Add GEOMETRY and GEOGRAPHY logical types (#240)
94b9d63 is described below
commit 94b9d631aef332c78b8f1482fb032743a9c3c407
Author: Gang Wu <[email protected]>
AuthorDate: Mon Feb 10 10:25:40 2025 +0800
PARQUET-2471: Add GEOMETRY and GEOGRAPHY logical types (#240)
Co-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Jia Yu <[email protected]>
---
Geospatial.md | 164 +++++++++++++++++++++++++++++++++++++++++
LogicalTypes.md | 33 +++++++++
src/main/thrift/parquet.thrift | 79 ++++++++++++++++++++
3 files changed, 276 insertions(+)
diff --git a/Geospatial.md b/Geospatial.md
new file mode 100644
index 0000000..4be4a38
--- /dev/null
+++ b/Geospatial.md
@@ -0,0 +1,164 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one
+ - or more contributor license agreements. See the NOTICE file
+ - distributed with this work for additional information
+ - regarding copyright ownership. The ASF licenses this file
+ - to you under the Apache License, Version 2.0 (the
+ - "License"); you may not use this file except in compliance
+ - with the License. You may obtain a copy of the License at
+ -
+ - http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing,
+ - software distributed under the License is distributed on an
+ - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ - KIND, either express or implied. See the License for the
+ - specific language governing permissions and limitations
+ - under the License.
+ -->
+
+Geospatial Definitions
+====
+
+This document contains the specification of geospatial types and statistics.
+
+# Background
+
+The Geometry and Geography class hierarchy and its Well-Known Text (WKT) and
+Well-Known Binary (WKB) serializations (ISO variant supporting XY, XYZ, XYM,
+XYZM) are defined by [OpenGIS Implementation Specification for Geographic
+information - Simple feature access - Part 1: Common architecture][sfa-part1],
+from [OGC(Open Geospatial Consortium)][ogc].
+
+The version of the OGC standard first used here is 1.2.1, but future versions
+may also be used if the WKB representation remains wire-compatible.
+
+[sfa-part1]: https://portal.ogc.org/files/?artifact_id=25355
+[ogc]: https://www.ogc.org/standard/sfa/
+
+## Coordinate Reference System
+
+Coordinate Reference System (CRS) is a mapping of how coordinates refer to
+locations on Earth.
+
+The default CRS `OGC:CRS84` means that the geospatial features must be stored
+in the order of longitude/latitude based on the WGS84 datum.
+
+Custom CRS can be specified by a string value. It is recommended to use an
+identifier-based approach like [Spatial reference identifier][srid].
+
+For geographic CRS, longitudes are bound by [-180, 180] and latitudes are bound
+by [-90, 90].
+
+[srid]: https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier
+
+## Edge Interpolation Algorithm
+
+An algorithm for interpolating edges, and is one of the following values:
+
+* `spherical`: edges are interpolated as geodesics on a sphere.
+* `vincenty`:
[https://en.wikipedia.org/wiki/Vincenty%27s_formulae](https://en.wikipedia.org/wiki/Vincenty%27s_formulae)
+* `thomas`: Thomas, Paul D. Spheroidal geodesics, reference systems, & local
geometry. US Naval Oceanographic Office, 1970.
+* `andoyer`: Thomas, Paul D. Mathematical models for navigation systems. US
Naval Oceanographic Office, 1965.
+* `karney`: [Karney, Charles FF. "Algorithms for geodesics." Journal of
Geodesy 87 (2013):
43-55](https://link.springer.com/content/pdf/10.1007/s00190-012-0578-z.pdf),
and [GeographicLib](https://geographiclib.sourceforge.io/)
+
+# Logical Types
+
+Two geospatial logical type annotations are supported:
+* `GEOMETRY`: geospatial features in the WKB format with linear/planar edges
interpolation. See [Geometry](LogicalTypes.md#geometry)
+* `GEOGRAPHY`: geospatial features in the WKB format with an explicit
(non-linear/non-planar) edges interpolation algorithm. See
[Geography](LogicalTypes.md#geography)
+
+# Statistics
+
+`GeospatialStatistics` is a struct specific for `GEOMETRY` and `GEOGRAPHY`
+logical types to store statistics of a column chunk. It is an optional field in
+the `ColumnMetaData` and contains [Bounding Box](#bounding-box) and [Geospatial
+Types](#geospatial-types) that are described below in detail.
+
+## Bounding Box
+
+A geospatial instance has at least two coordinate dimensions: X and Y for 2D
+coordinates of each point. Please note that X is longitude/easting and Y is
+latitude/northing. A geospatial instance can optionally have Z and/or M values
+associated with each point.
+
+The Z values introduce the third dimension coordinate. Usually they are used to
+indicate the height, or elevation.
+
+M values are an opportunity for a geospatial instance to express a fourth
+dimension as a coordinate value. These values can be used as a linear reference
+value (e.g., highway milepost value), a timestamp, or some other value as
defined
+by the CRS.
+
+Bounding box is defined as the thrift struct below in the representation of
+min/max value pair of coordinates from each axis. Note that X and Y Values are
+always present. Z and M are omitted for 2D geospatial instances.
+
+For the X values only, xmin may be greater than xmax. In this case, an object
+in this bounding box may match if it contains an X such that `x >= xmin` OR
+`x <= xmax`. This wraparound occurs only when the corresponding bounding box
+crosses the antimeridian line. In geographic terminology, the concepts of
`xmin`,
+`xmax`, `ymin`, and `ymax` are also known as `westernmost`, `easternmost`,
+`southernmost` and `northernmost`, respectively.
+
+For `GEOGRAPHY` types, X and Y values are restricted to the canonical ranges of
+[-180, 180] for X and [-90, 90] for Y.
+
+```thrift
+struct BoundingBox {
+ 1: required double xmin;
+ 2: required double xmax;
+ 3: required double ymin;
+ 4: required double ymax;
+ 5: optional double zmin;
+ 6: optional double zmax;
+ 7: optional double mmin;
+ 8: optional double mmax;
+}
+```
+
+## Geospatial Types
+
+A list of geospatial types from all instances in the `GEOMETRY` or `GEOGRAPHY`
+column, or an empty list if they are not known.
+
+This is borrowed from [geometry_types of GeoParquet][geometry-types] except
that
+values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code].
+Table below shows the most common geospatial types and their codes:
+
+| Type | XY | XYZ | XYM | XYZM |
+| :----------------- | :--- | :--- | :--- | :--: |
+| Point | 0001 | 1001 | 2001 | 3001 |
+| LineString | 0002 | 1002 | 2002 | 3002 |
+| Polygon | 0003 | 1003 | 2003 | 3003 |
+| MultiPoint | 0004 | 1004 | 2004 | 3004 |
+| MultiLineString | 0005 | 1005 | 2005 | 3005 |
+| MultiPolygon | 0006 | 1006 | 2006 | 3006 |
+| GeometryCollection | 0007 | 1007 | 2007 | 3007 |
+
+In addition, the following rules are applied:
+- A list of multiple values indicates that multiple geospatial types are
present (e.g. `[0003, 0006]`).
+- An empty array explicitly signals that the geospatial types are not known.
+- The geospatial types in the list must be unique (e.g. `[0001, 0001]` is not
valid).
+
+[geometry-types]:
https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159
+[wkb-integer-code]:
https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
+
+# CRS Customization
+
+CRS is represented as a string value. Writer and reader implementations are
+responsible for serializing and deserializing the CRS, respectively.
+
+As a convention to maximize the interoperability, custom CRS values can be
+specified by a string of the format `type:identifier`, where `type` is one of
+the following values:
+
+* `srid`: [Spatial reference
identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier),
`identifier` is the SRID itself.
+* `projjson`:
[PROJJSON](https://proj.org/en/stable/specifications/projjson.html),
`identifier` is the name of a table property or a file property where the
projjson string is stored.
+
+# Coordinate axis order
+
+The axis order of the coordinates in WKB and bounding box stored in Parquet
+follows the de facto standard for axis order in WKB and is therefore always
+(x, y) where x is easting or longitude and y is northing or latitude. This
+ordering explicitly overrides the axis order as specified in the CRS.
diff --git a/LogicalTypes.md b/LogicalTypes.md
index 7294015..e7a0ce0 100644
--- a/LogicalTypes.md
+++ b/LogicalTypes.md
@@ -599,6 +599,39 @@ optional group variant_shredded (VARIANT) {
}
```
+### GEOMETRY
+
+`GEOMETRY` is used for geospatial features in the Well-Known Binary (WKB)
format
+with linear/planar edges interpolation. It must annotate a `BYTE_ARRAY`
+primitive type. See [Geospatial.md](Geospatial.md) for more detail.
+
+The type has only one type parameter:
+- `crs`: An optional string value for CRS. If unset, the CRS defaults to
+ `"OGC:CRS84"`, which means that the geometries must be stored in longitude,
+ latitude based on the WGS84 datum.
+
+The sort order used for `GEOMETRY` is undefined. When writing data, no min/max
+statistics should be saved for this type and if such non-compliant statistics
+are found during reading, they must be ignored.
+
+### GEOGRAPHY
+
+`GEOGRAPHY` is used for geospatial features in the WKB format with an explicit
+(non-linear/non-planar) edges interpolation algorithm. It must annotate a
+`BYTE_ARRAY` primitive type. See [Geospatial.md](Geospatial.md) for more
detail.
+
+The type has two type parameters:
+- `crs`: An optional string value for CRS. It must be a geographic CRS, where
+ longitudes are bound by [-180, 180] and latitudes are bound by [-90, 90].
+ If unset, the CRS defaults to `"OGC:CRS84"`.
+- `algorithm`: An optional enum value to describes the edge interpolation
+ algorithm. Supported values are: `SPHERICAL`, `VINCENTY`, `THOMAS`,
`ANDOYER`,
+ `KARNEY`. If unset, the algorithm defaults to `SPHERICAL`.
+
+The sort order used for `GEOGRAPHY` is undefined. When writing data, no min/max
+statistics should be saved for this type and if such non-compliant statistics
+are found during reading, they must be ignored.
+
## Nested Types
This section specifies how `LIST` and `MAP` can be used to encode nested types
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 5d4431d..ee701aa 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -237,6 +237,29 @@ struct SizeStatistics {
3: optional list<i64> definition_level_histogram;
}
+/**
+ * Bounding box for GEOMETRY or GEOGRAPHY type in the representation of min/max
+ * value pair of coordinates from each axis.
+ */
+struct BoundingBox {
+ 1: required double xmin;
+ 2: required double xmax;
+ 3: required double ymin;
+ 4: required double ymax;
+ 5: optional double zmin;
+ 6: optional double zmax;
+ 7: optional double mmin;
+ 8: optional double mmax;
+}
+
+/** Statistics specific to Geometry and Geography logical types */
+struct GeospatialStatistics {
+ /** A bounding box of geospatial instances */
+ 1: optional BoundingBox bbox;
+ /** Geospatial type codes of all instances, or an empty list if not known */
+ 2: optional list<i32> geospatial_types;
+}
+
/**
* Statistics per row group and per page
* All fields are optional.
@@ -386,6 +409,55 @@ struct BsonType {
struct VariantType {
}
+/** Edge interpolation algorithm for Geography logical type */
+enum EdgeInterpolationAlgorithm {
+ SPHERICAL = 0;
+ VINCENTY = 1;
+ THOMAS = 2;
+ ANDOYER = 3;
+ KARNEY = 4;
+}
+
+/**
+ * Embedded Geometry logical type annotation
+ *
+ * Geospatial features in the Well-Known Binary (WKB) format and edges
interpolation
+ * is always linear/planar.
+ *
+ * A custom CRS can be set by the crs field. If unset, it defaults to
"OGC:CRS84",
+ * which means that the geometries must be stored in longitude, latitude based
on
+ * the WGS84 datum.
+ *
+ * Allowed for physical type: BYTE_ARRAY.
+ *
+ * See Geospatial.md for details.
+ */
+struct GeometryType {
+ 1: optional string crs;
+}
+
+/**
+ * Embedded Geography logical type annotation
+ *
+ * Geospatial features in the WKB format with an explicit
(non-linear/non-planar)
+ * edges interpolation algorithm.
+ *
+ * A custom geographic CRS can be set by the crs field, where longitudes are
+ * bound by [-180, 180] and latitudes are bound by [-90, 90]. If unset, the CRS
+ * defaults to "OGC:CRS84".
+ *
+ * An optional algorithm can be set to correctly interpret edges interpolation
+ * of the geometries. If unset, the algorithm defaults to SPHERICAL.
+ *
+ * Allowed for physical type: BYTE_ARRAY.
+ *
+ * See Geospatial.md for details.
+ */
+struct GeographyType {
+ 1: optional string crs;
+ 2: optional EdgeInterpolationAlgorithm algorithm;
+}
+
/**
* LogicalType annotations to replace ConvertedType.
*
@@ -417,6 +489,8 @@ union LogicalType {
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
16: VariantType VARIANT // no compatible ConvertedType
+ 17: GeometryType GEOMETRY // no compatible ConvertedType
+ 18: GeographyType GEOGRAPHY // no compatible ConvertedType
}
/**
@@ -857,6 +931,9 @@ struct ColumnMetaData {
* filter pushdown.
*/
16: optional SizeStatistics size_statistics;
+
+ /** Optional statistics specific for Geometry and Geography logical types */
+ 17: optional GeospatialStatistics geospatial_statistics;
}
struct EncryptionWithFooterKey {
@@ -988,6 +1065,8 @@ union ColumnOrder {
* LIST - undefined
* MAP - undefined
* VARIANT - undefined
+ * GEOMETRY - undefined
+ * GEOGRAPHY - undefined
*
* In the absence of logical types, the sort order is determined by the
physical type:
* BOOLEAN - false, true