Menelaos Karavelas created SPARK-51658:
------------------------------------------
Summary: SPIP: Add geospatial types in Spark
Key: SPARK-51658
URL: https://issues.apache.org/jira/browse/SPARK-51658
Project: Spark
Issue Type: New Feature
Components: SQL
Affects Versions: 4.1.0
Reporter: Menelaos Karavelas
*Q1. What are you trying to do? Articulate your objectives using absolutely no
jargon.*
Add two new data types to Spark SQL and PySpark, GEOMETRY and GEOGRAPHY, for
handling location and shape data, specifically:
# GEOMETRY: For working with shapes on a Cartesian space (a flat surface, like
a paper map)
# GEOGRAPHY: For working with shapes on an ellipsoidal surface (like the
Earth's surface)
These will let Spark users work with standard shape types like:
* POINTS, MULTIPOINTS
* LINESTRINGS, MULTILINESTRINGS
* POLYGONS, MULTIPOLYGONS
* GEOMETRY COLLECTIONS
These are [standard shape
types|https://portal.ogc.org/files/?artifact_id=25355] as defined by the [Open
Geospatial Consortium (OGC)|https://www.ogc.org/]. OGC is the international
governing body for standardizing Geographic Information Systems (GIS).
New SQL types will be parametrized with an integer value (Spatial Reference
IDentifier - SRID) that defines the underlying coordinate reference system that
geospatial values live in (like GPS coordinates vs. local measurements).
Popular data storage formats
([Parquet|https://github.com/apache/parquet-format/blob/master/Geospatial.md]
and [Iceberg|https://github.com/apache/iceberg/pull/10981]) are adding support
for GEOMETRY and GEOGRAPHY, and the proposal is to extend Spark such that it
can read / write and operate with new geo types.
*Q2. What problem is this proposal NOT designed to solve?*
Build a comprehensive system for advanced geospatial analysis in Spark. While
we're adding basic support for storing and reading location/shape data, we are
not:
* Creating complex geospatial processing functions
* Building spatial analysis tools
* Developing geometric or geographic calculations and transformations
This proposal is laying the foundation - building the infrastructure to handle
geospatial data, but not creating a full-featured geospatial processing system.
Such extension can be done later as a separate improvement.
*Q3. How is it done today, and what are the limits of current practice?*
Current State:
# Users store geospatial data by converting it into basic text (string) or
binary formats because Spark doesn't understand geospatial data directly.
# To work with this data, users need to use external tools to make sense of
and process the data.
Limitations:
* Have to use external libs
* Have to do conversions
* Cannot write / read geospatial values in a native way
* External tools cannot natively understand the data as geospatial
* No file / data skipping
*Q4. What is new in your approach and why do you think it will be successful?*
The high-level approach is not new, and we have a clear picture of how to split
the work by sub-tasks based on our experience of adding new types such as ANSI
intervals and TIMESTAMP_NTZ.
Most existing data processing systems support working with Geospatial data.
*Q5. Who cares? If you are successful, what difference will it make?*
New types create the foundation for working with geospatial data on Spark.
Other data analytics systems such as PostgreSQL, Redshift, Snowflake, Big
Query, all have geospatial support.
Spark stays relevant and compatible with the latest Parquet and Iceberg
community developments.
*Q6. What are the risks?*
The addition of the new data types will allow for strictly new workloads, so
the risks in this direction are minimal. The only overlap with existing
functionality is at the type system level (e.g., casts between the new
geospatial types and the existing Spark SQL types). The risk is low however,
and can be handled through testing.
*Q7. How long will it take?*
In total it might take around 9 months. The estimation is based on similar
tasks: ANSI intervals (SPARK-27790) and TIMESTAMP_NTZ (SPARK-35662). We can
split the work by function blocks:
# Base functionality - 2 weeks
Add new type geospatial types, literals, type constructor, and external types.
# Persistence - 2.5 months
Ability to create parquet tables of the type GEOSPATIAL, read/write from/to
Parquet and other built-in data types, stats, predicate push down.
# Basic data skipping operator - 1 months
# Implement rudimentary operators for data skipping - (e.g., ST_BoxIntersects
- checks if the bounding boxes of two geospatial values intersect each other)
# Clients support - 1 month
JDBC, Hive, Thrift server, connect
# PySpark integration - 1 month
DataFrame support, pandas API, python UDFs, Arrow column vectors
# Docs + testing/benchmarking - 1 month
*Q8. What are the mid-term and final “exams” to check for success?*
Mid-term criteria: New type definitions and read/write geospatial data from
Parquet.
Final criteria: Equivalent functionality with any other scalar data type.
*Appendix A. Proposed API Changes.*
_Geospatial data types_
We propose to support two new parametrized data types: GEOMETRY and GEOGRAPHY.
Both take as input a non-negative integer value called SRID (referring to a
spatial reference identifier). See Appendix B for more details on the type
system.
Examples:
GEOMETRY(0) - Materializable, every row has SRID 0
GEOMETRY(ANY) - Not materializable, every row can have different SRID
GEOGRAPHY(4326) - Materializable, every row has SRID 4326
GEOGRAPHY(ANY) - Not materializable, every row can have different SRID
GEOMETRY(ANY) and GEOGRAPHY(ANY) act as least common types among the
parametrized GEOMETRY and GEOGRAPHY types.
h4. _Table creation_
Users will be able to create GEOMETRY(<srid>) or GEOGRAPHY(<srid>) columns:
```
CREATE TABLE tbl (geom GEOMETRY(0), geog GEOGRAPHY(4326));
```
Users will not be able to create GEOMETRY(ANY) or GEOGRAPHY(ANY) columns. See
details in Appendix B.
Users will be able to insert geospatial values to such columns:
```
INSERT INTO tbl
VALUES(X‘0101000000000000000000f03f0000000000000040’::GEOMETRY(0),
X’0101000000000000000000f03f0000000000000040’::GEOGRAPHY(4326));
```
h4. _Casts_
The following explicit casts as allowed (all other casts are initially
disallowed):
* GEOMETRY/GEOGRAPHY column to BINARY (output is WKB format)
* BINARY to GEOMETRY/GEOGRAPHY column (input is expected to be in WKB format)
* GEOMETRY(ANY) to GEOMETRY(<srid>)
* GEOGRAPHY(ANY) to GEOGRAPHY(<srid>)
The following implicit casts are allowed (also supported as explicit):
* GEOMETRY(<srid>) to GEOMETRY(ANY)
* GEOGRAPHY(<srid>) to GEOGRAPHY(ANY)
h4. _Geospatial expressions_
The following geospatial expressions are to be exposed to users:
* Scalar expressions
* Boolean ST_BoxesIntersect(BoxStruct1, BoxStruct2)
* BoxStruct ST_Box2D(GeoVal)
* BinaryVal ST_AsBinary(GeoVal)
* GeoVal ST_GeomFromWKB(BinaryVal, IntVal)
* GeoVal ST_GeogFromWKB(BinaryVal, IntVal)
* Aggregate expressions
* BoxStruct ST_Extent(GeoCol)
BoxStruct above refers to a struct of 4 double values.
h4. _Ecosystem_
Support for SQL, Dataframe, and PySpark APIs including Spark Connect APIs.
*Appendix B.*
h4. _Type System_
We propose to introduce two parametrized types: GEOMETRY and GEOGRAPHY. The
parameter would either be a non-negative integer value or the special specifier
ANY.
* The integer value represents a Spatial Reference IDentifier (SRID) which
uniquely defines the coordinate reference system. These integers will be mapped
to coordinate reference system (CRS) definitions defined by authorities like
OGC, EPSG, or ESRI.
* The ANY specifier refers to GEOMETRY or GEOGRAPHY columns where the SRID
value can be different across the column rows. GEOMETRY(ANY) and GEOGRAPHY(ANY)
will not be materialized. This constraint is imposed by the storage
specifications (the Parquet and Iceberg specs require a unique SRID value per
GEOMETRY or GEOGRAPHY column).
* GEOMETRY(ANY) and GEOGRAPHY(ANY) act as the least common type across all
GEOMETRY and GEOGRAPHY parametrized types. They are also needed for the
proposed ST_GeomFromWKB and ST_GeogFromWKB expressions when the second argument
is not foldable; for non-foldable arguments the return type for these
expressions is GEOMETRY(ANY) and GEOGRAPHY(ANY), respectively.
* All known SRID values are positive integers. For the GEOMETRY data type we
will also allow the SRID value of 0. This is the typical way of specifying
geometric values with unknown or unspecified underlying CRS. Mathematically
these values are understood as embedded in a Cartesian space.
h4. _In-memory representation_
We propose to internally represent geospatial values as a byte array containing
the concatenation of the BINARY value and an INTEGER value. The BINARY value is
the WKB representation of the GEOMETRY or GEOGRAPHY value. The INTEGER value is
the SRID value. WKB stands for Well-Known Binary and it is one of the standard
formats for representing geospatial values (see
[https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary]).
WKB is also the format used by the latest Parquet and Iceberg specs for
representing geospatial values.
h4. _Serialization_
We propose to only serialize GEOMETRY and GEOGRAPHY columns that have a numeric
SRID value. For such columns the SRID value can be deciphered from the column
metadata. The serialization format for values will be WKB.
h4. _Casting_
Implicit casting from GEOMETRY(<srid>) to GEOMETRY(ANY) and GEOGRAPHY(<srid>)
to GEOGRAPHY(ANY) is primarily introduced in order to smoothly support
expressions like CASE/WHEN, COALESCE, NVL2, etc., that depend on least common
type logic.
h4. _Dependencies_
We propose to add a build dependency on the Proj library
([https://proj.org/en/stable]) for generating a list of supported CRS
definitions. We will need to expand this list with SRID 0 which is necessary
when dealing with geometry inputs with unspecified SRID.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]