This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/orc-format.git
The following commit(s) were added to refs/heads/main by this push:
new e807a18 ORC-1717: Add geometry and geography types (#18)
e807a18 is described below
commit e807a18b1029834f2729f5cc2e350e6499d2c2ec
Author: Gang Wu <[email protected]>
AuthorDate: Thu Apr 10 07:47:40 2025 +0800
ORC-1717: Add geometry and geography types (#18)
### What changes were proposed in this pull request?
Add geometry and geography types to Apache ORC.
### Why are the changes needed?
Geospatial support is a missing feature and it is supported by many popular
databases, query engines, computing frameworks, etc.
### How was this patch tested?
N/A
---
specification/ORCv2.md | 175 +++++++++++++++++++++++++++++++
src/main/proto/orc/proto/orc_proto.proto | 36 +++++++
2 files changed, 211 insertions(+)
diff --git a/specification/ORCv2.md b/specification/ORCv2.md
index 73daf6e..9aaeb91 100644
--- a/specification/ORCv2.md
+++ b/specification/ORCv2.md
@@ -261,6 +261,8 @@ message Type {
VARCHAR = 16;
CHAR = 17;
TIMESTAMP_INSTANT = 18;
+ GEOMETRY = 19;
+ GEOGRAPHY = 20;
}
// the kind of this type
required Kind kind = 1;
@@ -273,9 +275,84 @@ message Type {
// the precision and scale for decimal
optional uint32 precision = 5;
optional uint32 scale = 6;
+ repeated StringPair attributes = 7;
+ // the attributes associated with the geometry type
+ optional GeometryType geometry = 8;
+ // Coordinate Reference System (CRS) for Geometry and Geography types
+ optional string crs = 8;
+ // Edge interpolation algorithm for Geography type
+ enum EdgeInterpolationAlgorithm {
+ SPHERICAL = 0;
+ VINCENTY = 1;
+ THOMAS = 2;
+ ANDOYER = 3;
+ KARNEY = 4;
+ }
+ optional EdgeInterpolationAlgorithm algorithm = 9;
}
```
+#### Geometry & Geography Types
+
+##### Background
+
+The Geometry and Geography class hierarchy and its Well-Known Text (WKT) and
+Well-Known Binary (WKB) serializations (ISO variant supporting XY, XYZ, XYM,
+XYZM) are defined by [OpenGIS Implementation Specification for Geographic
+information - Simple feature access - Part 1: Common architecture][sfa-part1],
+from [OGC(Open Geospatial Consortium)][ogc].
+
+The version of the OGC standard first used here is 1.2.1, but future versions
+may also be used if the WKB representation remains wire-compatible.
+
+[sfa-part1]: https://portal.ogc.org/files/?artifact_id=25355
+[ogc]: https://www.ogc.org/standard/sfa/
+
+###### Coordinate Reference System
+
+Coordinate Reference System (CRS) is a mapping of how coordinates refer to
+locations on Earth.
+
+The default CRS `OGC:CRS84` means that the geospatial features must be stored
+in the order of longitude/latitude based on the WGS84 datum.
+
+Custom CRS can be specified by a string value. It is recommended to use an
+identifier-based approach like [Spatial reference identifier][srid].
+
+For geographic CRS, longitudes are bound by [-180, 180] and latitudes are bound
+by [-90, 90].
+
+[srid]: https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier
+
+###### Edge Interpolation Algorithm
+
+An algorithm for interpolating edges, and is one of the following values:
+
+* `spherical`: edges are interpolated as geodesics on a sphere.
+* `vincenty`:
[https://en.wikipedia.org/wiki/Vincenty%27s_formulae](https://en.wikipedia.org/wiki/Vincenty%27s_formulae)
+* `thomas`: Thomas, Paul D. Spheroidal geodesics, reference systems, & local
geometry. US Naval Oceanographic Office, 1970.
+* `andoyer`: Thomas, Paul D. Mathematical models for navigation systems. US
Naval Oceanographic Office, 1965.
+* `karney`: [Karney, Charles FF. "Algorithms for geodesics." Journal of
Geodesy 87 (2013):
43-55](https://link.springer.com/content/pdf/10.1007/s00190-012-0578-z.pdf),
and [GeographicLib](https://geographiclib.sourceforge.io/)
+
+###### CRS Customization
+
+CRS is represented as a string value. Writer and reader implementations are
+responsible for serializing and deserializing the CRS, respectively.
+
+As a convention to maximize the interoperability, custom CRS values can be
+specified by a string of the format `type:identifier`, where `type` is one of
+the following values:
+
+* `srid`: [Spatial reference
identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier),
`identifier` is the SRID itself.
+* `projjson`:
[PROJJSON](https://proj.org/en/stable/specifications/projjson.html),
`identifier` is the name of a table property or a file property where the
projjson string is stored.
+
+###### Coordinate Axis Order
+
+The axis order of the coordinates in WKB and bounding box stored here
+follows the de facto standard for axis order in WKB and is therefore always
+(x, y) where x is easting or longitude and y is northing or latitude. This
+ordering explicitly overrides the axis order as specified in the CRS.
+
### Column Statistics
The goal of the column statistics is that for each column, the writer
@@ -303,6 +380,7 @@ message ColumnStatistics {
optional bool hasNull = 10;
optional uint64 bytes_on_disk = 11;
optional CollectionStatistics collection_statistics = 12;
+ optional GeospatialStatistics geospatial_statistics = 13;
}
```
@@ -397,6 +475,88 @@ message BinaryStatistics {
}
```
+Geometry and Geography columns store optional bounding boxes and list of
+geospatial type codes from all values.
+
+**Bounding Box**
+
+A geospatial instance has at least two coordinate dimensions: X and Y for 2D
+coordinates of each point. Please note that X is longitude/easting and Y is
+latitude/northing. A geospatial instance can optionally have Z and/or M values
+associated with each point.
+
+The Z values introduce the third dimension coordinate. Usually they are used to
+indicate the height, or elevation.
+
+M values are an opportunity for a geospatial instance to express a fourth
+dimension as a coordinate value. These values can be used as a linear reference
+value (e.g., highway milepost value), a timestamp, or some other value as
defined
+by the CRS.
+
+Bounding box is defined as the thrift struct below in the representation of
+min/max value pair of coordinates from each axis. Note that X and Y Values are
+always present. Z and M are omitted for 2D geospatial instances.
+
+For the X values only, xmin may be greater than xmax. In this case, an object
+in this bounding box may match if it contains an X such that `x >= xmin` OR
+`x <= xmax`. This wraparound occurs only when the corresponding bounding box
+crosses the antimeridian line. In geographic terminology, the concepts of
`xmin`,
+`xmax`, `ymin`, and `ymax` are also known as `westernmost`, `easternmost`,
+`southernmost` and `northernmost`, respectively.
+
+For Geography type, X and Y values are restricted to the canonical ranges of
+[-180, 180] for X and [-90, 90] for Y.
+
+**Geospatial Types**
+
+A list of geospatial types from all instances in the Geometry or Geography
+column, or an empty list if they are not known.
+
+This is borrowed from [geometry_types of GeoParquet][geometry-types] except
that
+values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code].
+Table below shows the most common geospatial types and their codes:
+
+| Type | XY | XYZ | XYM | XYZM |
+| :----------------- | :--- | :--- | :--- | :--: |
+| Point | 0001 | 1001 | 2001 | 3001 |
+| LineString | 0002 | 1002 | 2002 | 3002 |
+| Polygon | 0003 | 1003 | 2003 | 3003 |
+| MultiPoint | 0004 | 1004 | 2004 | 3004 |
+| MultiLineString | 0005 | 1005 | 2005 | 3005 |
+| MultiPolygon | 0006 | 1006 | 2006 | 3006 |
+| GeometryCollection | 0007 | 1007 | 2007 | 3007 |
+
+In addition, the following rules are applied:
+- A list of multiple values indicates that multiple geospatial types are
present (e.g. `[0003, 0006]`).
+- An empty array explicitly signals that the geospatial types are not known.
+- The geospatial types in the list must be unique (e.g. `[0001, 0001]` is not
valid).
+
+[geometry-types]:
https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159
+[wkb-integer-code]:
https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
+
+```protobuf
+// Bounding box for Geometry or Geography type in the representation of min/max
+// value pair of coordinates from each axis.
+message BoundingBox {
+ optional double xmin = 1;
+ optional double xmax = 2;
+ optional double ymin = 3;
+ optional double ymax = 4;
+ optional double zmin = 5;
+ optional double zmax = 6;
+ optional double mmin = 7;
+ optional double mmax = 8;
+}
+
+// Statistics specific to Geometry or Geography type
+message GeospatialStatistics {
+ // A bounding box of geospatial instances
+ optional BoundingBox bbox = 1;
+ // Geospatial type codes of all instances, or an empty list if not known
+ repeated int32 geospatial_types = 2;
+}
+```
+
### User Metadata
The user can add arbitrary key/value pairs to an ORC file as it is
@@ -1235,6 +1395,21 @@ Encoding | Stream Kind | Optional | Contents
DIRECT | PRESENT | Yes | Boolean RLE
| DIRECT | No | Byte RLE
+## Geometry & Geography Columns
+
+Geometry and Geography data are encoded with a PRESENT stream, a DATA stream
that records
+the WKB-encoded geometry/geography data as binary, and a LENGTH stream that
records
+the number of bytes per a value.
+
+Encoding | Stream Kind | Optional | Contents
+:------------ | :-------------- | :------- | :-------
+DIRECT | PRESENT | Yes | Boolean RLE
+ | DATA | No | Binary contents
+ | LENGTH | No | Unsigned Integer RLE v1
+DIRECT_V2 | PRESENT | Yes | Boolean RLE
+ | DATA | No | Binary contents
+ | LENGTH | No | Unsigned Integer RLE v2
+
# Indexes
## Row Group Index
diff --git a/src/main/proto/orc/proto/orc_proto.proto
b/src/main/proto/orc/proto/orc_proto.proto
index 16c5523..1c38fc7 100644
--- a/src/main/proto/orc/proto/orc_proto.proto
+++ b/src/main/proto/orc/proto/orc_proto.proto
@@ -84,6 +84,27 @@ message CollectionStatistics {
optional uint64 total_children = 3;
}
+// Bounding box for Geometry or Geography type in the representation of min/max
+// value pair of coordinates from each axis.
+message BoundingBox {
+ optional double xmin = 1;
+ optional double xmax = 2;
+ optional double ymin = 3;
+ optional double ymax = 4;
+ optional double zmin = 5;
+ optional double zmax = 6;
+ optional double mmin = 7;
+ optional double mmax = 8;
+}
+
+// Statistics specific to Geometry or Geography type
+message GeospatialStatistics {
+ // A bounding box of geospatial instances
+ optional BoundingBox bbox = 1;
+ // Geospatial type codes of all instances, or an empty list if not known
+ repeated int32 geospatial_types = 2;
+}
+
message ColumnStatistics {
optional uint64 number_of_values = 1;
optional IntegerStatistics int_statistics = 2;
@@ -97,6 +118,7 @@ message ColumnStatistics {
optional bool has_null = 10;
optional uint64 bytes_on_disk = 11;
optional CollectionStatistics collection_statistics = 12;
+ optional GeospatialStatistics geospatial_statistics = 13;
}
message RowIndexEntry {
@@ -216,6 +238,8 @@ message Type {
VARCHAR = 16;
CHAR = 17;
TIMESTAMP_INSTANT = 18;
+ GEOMETRY = 19;
+ GEOGRAPHY = 20;
}
optional Kind kind = 1;
repeated uint32 subtypes = 2 [packed=true];
@@ -224,6 +248,18 @@ message Type {
optional uint32 precision = 5;
optional uint32 scale = 6;
repeated StringPair attributes = 7;
+
+ // Coordinate Reference System (CRS) for Geometry and Geography types
+ optional string crs = 8;
+ // Edge interpolation algorithm for Geography type
+ enum EdgeInterpolationAlgorithm {
+ SPHERICAL = 0;
+ VINCENTY = 1;
+ THOMAS = 2;
+ ANDOYER = 3;
+ KARNEY = 4;
+ }
+ optional EdgeInterpolationAlgorithm algorithm = 9;
}
message StripeInformation {