[jira] [Created] (SPARK-51658) SPIP: Add geospatial types in Spark

Menelaos Karavelas (Jira) Fri, 28 Mar 2025 10:35:08 -0700

Menelaos Karavelas created SPARK-51658:
------------------------------------------


             Summary: SPIP: Add geospatial types in Spark
                 Key: SPARK-51658
                 URL: https://issues.apache.org/jira/browse/SPARK-51658
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 4.1.0
            Reporter: Menelaos Karavelas


*Q1. What are you trying to do? Articulate your objectives using absolutely no 
jargon.*

Add two new data types to Spark SQL and PySpark, GEOMETRY and GEOGRAPHY, for 
handling location and shape data, specifically:
 # GEOMETRY: For working with shapes on a Cartesian space (a flat surface, like 
a paper map)
 # GEOGRAPHY: For working with shapes on an ellipsoidal surface (like the 
Earth's surface)

These will let Spark users work with standard shape types like:
 * POINTS, MULTIPOINTS
 * LINESTRINGS, MULTILINESTRINGS
 * POLYGONS, MULTIPOLYGONS
 * GEOMETRY COLLECTIONS

These are [standard shape 
types|https://portal.ogc.org/files/?artifact_id=25355] as defined by the [Open 
Geospatial Consortium (OGC)|https://www.ogc.org/]. OGC is the international 
governing body for standardizing Geographic Information Systems (GIS). 

New SQL types will be parametrized with an integer value (Spatial Reference 
IDentifier - SRID) that defines the underlying coordinate reference system that 
geospatial values live in (like GPS coordinates vs. local measurements).

Popular data storage formats 
([Parquet|https://github.com/apache/parquet-format/blob/master/Geospatial.md] 
and [Iceberg|https://github.com/apache/iceberg/pull/10981]) are adding support 
for GEOMETRY and GEOGRAPHY, and the proposal is to extend Spark such that it 
can read / write and operate with new geo types.

 

*Q2. What problem is this proposal NOT designed to solve?*

Build a comprehensive system for advanced geospatial analysis in Spark. While 
we're adding basic support for storing and reading location/shape data, we are 
not:
 * Creating complex geospatial processing functions
 * Building spatial analysis tools
 * Developing geometric or geographic calculations and transformations

This proposal is laying the foundation - building the infrastructure to handle 
geospatial data, but not creating a full-featured geospatial processing system. 
Such extension can be done later as a separate improvement.

 

*Q3. How is it done today, and what are the limits of current practice?*

Current State:
 # Users store geospatial data by converting it into basic text (string) or 
binary formats because Spark doesn't understand geospatial data directly.
 # To work with this data, users need to use external tools to make sense of 
and process the data.

Limitations:
 * Have to use external libs
 * Have to do conversions
 * Cannot write / read geospatial values in a native way
 * External tools cannot natively understand the data as geospatial
 * No file / data skipping

 

*Q4. What is new in your approach and why do you think it will be successful?*

The high-level approach is not new, and we have a clear picture of how to split 
the work by sub-tasks based on our experience of adding new types such as ANSI 
intervals and TIMESTAMP_NTZ.

Most existing data processing systems support working with Geospatial data.

 

*Q5. Who cares? If you are successful, what difference will it make?*

New types create the foundation for working with geospatial data on Spark. 
Other data analytics systems such as PostgreSQL, Redshift, Snowflake, Big 
Query, all have geospatial support.

Spark stays relevant and compatible with the latest Parquet and Iceberg 
community developments.

 

*Q6. What are the risks?*

The addition of the new data types will allow for strictly new workloads, so 
the risks in this direction are minimal. The only overlap with existing 
functionality is at the type system level (e.g., casts between the new 
geospatial types and the existing Spark SQL types). The risk is low however, 
and can be handled through testing.

 

*Q7. How long will it take?*

In total it might take around 9 months. The estimation is based on similar 
tasks: ANSI intervals (SPARK-27790) and TIMESTAMP_NTZ (SPARK-35662). We can 
split the work by function blocks:
 # Base functionality - 2 weeks
Add new type geospatial types, literals, type constructor, and external types.
 # Persistence - 2.5 months
Ability to create parquet tables of the type GEOSPATIAL, read/write from/to 
Parquet and other built-in data types, stats, predicate push down.
 # Basic data skipping operator - 1 months
 # Implement rudimentary operators for data skipping - (e.g., ST_BoxIntersects 
- checks if the bounding boxes of two geospatial values intersect each other)
 # Clients support - 1 month
JDBC, Hive, Thrift server, connect
 # PySpark integration - 1 month
DataFrame support, pandas API, python UDFs, Arrow column vectors
 # Docs + testing/benchmarking - 1 month

 

*Q8. What are the mid-term and final “exams” to check for success?*

Mid-term criteria: New type definitions and read/write geospatial data from 
Parquet.


Final criteria: Equivalent functionality with any other scalar data type.

 

*Appendix A. Proposed API Changes.*

_Geospatial data types_

We propose to support two new parametrized data types: GEOMETRY and GEOGRAPHY. 
Both take as input a non-negative integer value called SRID (referring to a 
spatial reference identifier). See Appendix B for more details on the type 
system.

Examples:

 GEOMETRY(0)              - Materializable, every row has SRID 0

 GEOMETRY(ANY)         - Not materializable, every row can have different SRID

 GEOGRAPHY(4326)     - Materializable, every row has SRID 4326

 GEOGRAPHY(ANY)       - Not materializable, every row can have different SRID

 

GEOMETRY(ANY) and GEOGRAPHY(ANY) act as least common types among the 
parametrized GEOMETRY and GEOGRAPHY types.
h4. _Table creation_

Users will be able to create GEOMETRY(<srid>) or GEOGRAPHY(<srid>) columns:

```

CREATE TABLE tbl (geom GEOMETRY(0), geog GEOGRAPHY(4326));

```

Users will not be able to create GEOMETRY(ANY) or GEOGRAPHY(ANY) columns. See 
details in Appendix B.

 

Users will be able to insert geospatial values to such columns:

```

INSERT INTO tbl 
VALUES(X‘0101000000000000000000f03f0000000000000040’::GEOMETRY(0), 
X’0101000000000000000000f03f0000000000000040’::GEOGRAPHY(4326));

```
h4. _Casts_

The following explicit casts as allowed (all other casts are initially 
disallowed):
 * GEOMETRY/GEOGRAPHY column to BINARY (output is WKB format)
 * BINARY to GEOMETRY/GEOGRAPHY column (input is expected to be in WKB format)
 * GEOMETRY(ANY) to GEOMETRY(<srid>)
 * GEOGRAPHY(ANY) to GEOGRAPHY(<srid>)

The following implicit casts are allowed (also supported as explicit):
 * GEOMETRY(<srid>) to GEOMETRY(ANY)
 * GEOGRAPHY(<srid>) to GEOGRAPHY(ANY)

h4. _Geospatial expressions_

The following geospatial expressions are to be exposed to users:
 * Scalar expressions
 * Boolean ST_BoxesIntersect(BoxStruct1, BoxStruct2)
 * BoxStruct ST_Box2D(GeoVal)
 * BinaryVal ST_AsBinary(GeoVal)
 * GeoVal ST_GeomFromWKB(BinaryVal, IntVal)
 * GeoVal ST_GeogFromWKB(BinaryVal, IntVal)


 * Aggregate expressions
 * BoxStruct ST_Extent(GeoCol)

BoxStruct above refers to a struct of 4 double values.
h4. _Ecosystem_

Support for SQL, Dataframe, and PySpark APIs including Spark Connect APIs.

 

*Appendix B.* 
h4. _Type System_

We propose to introduce two parametrized types: GEOMETRY and GEOGRAPHY. The 
parameter would either be a non-negative integer value or the special specifier 
ANY.
 * The integer value represents a Spatial Reference IDentifier (SRID) which 
uniquely defines the coordinate reference system. These integers will be mapped 
to coordinate reference system (CRS) definitions defined by authorities like 
OGC, EPSG, or ESRI.
 * The ANY specifier refers to GEOMETRY or GEOGRAPHY columns where the SRID 
value can be different across the column rows. GEOMETRY(ANY) and GEOGRAPHY(ANY) 
will not be materialized. This constraint is imposed by the storage 
specifications (the Parquet and Iceberg specs require a unique SRID value per 
GEOMETRY or GEOGRAPHY column).
 * GEOMETRY(ANY) and GEOGRAPHY(ANY) act as the least common type across all 
GEOMETRY and GEOGRAPHY parametrized types. They are also needed for the 
proposed ST_GeomFromWKB and ST_GeogFromWKB expressions when the second argument 
is not foldable; for non-foldable arguments the return type for these 
expressions is GEOMETRY(ANY) and GEOGRAPHY(ANY), respectively.
 * All known SRID values are positive integers. For the GEOMETRY data type we 
will also allow the SRID value of 0. This is the typical way of specifying 
geometric values with unknown or unspecified underlying CRS. Mathematically 
these values are understood as embedded in a Cartesian space.

h4. _In-memory representation_

We propose to internally represent geospatial values as a byte array containing 
the concatenation of the BINARY value and an INTEGER value. The BINARY value is 
the WKB representation of the GEOMETRY or GEOGRAPHY value. The INTEGER value is 
the SRID value. WKB stands for Well-Known Binary and it is one of the standard 
formats for representing geospatial values (see 
[https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary]).
 WKB is also the format used by the latest Parquet and Iceberg specs for 
representing geospatial values.
h4. _Serialization_

We propose to only serialize GEOMETRY and GEOGRAPHY columns that have a numeric 
SRID value. For such columns the SRID value can be deciphered from the column 
metadata. The serialization format for values will be WKB.
h4. _Casting_

Implicit casting from GEOMETRY(<srid>) to GEOMETRY(ANY) and GEOGRAPHY(<srid>) 
to GEOGRAPHY(ANY) is primarily introduced in order to smoothly support 
expressions like CASE/WHEN, COALESCE, NVL2, etc., that depend on least common 
type logic.
h4. _Dependencies_

We propose to add a build dependency on the Proj library 
([https://proj.org/en/stable]) for generating a list of supported CRS 
definitions. We will need to expand this list with SRID 0 which is necessary 
when dealing with geometry inputs with unspecified SRID.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-51658) SPIP: Add geospatial types in Spark

Reply via email to