[
https://issues.apache.org/jira/browse/SPARK-51658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-51658:
----------------------------------
Fix Version/s: (was: 4.1.0)
> SPIP: Add geospatial types in Spark
> -----------------------------------
>
> Key: SPARK-51658
> URL: https://issues.apache.org/jira/browse/SPARK-51658
> Project: Spark
> Issue Type: Umbrella
> Components: SQL
> Affects Versions: 4.1.0
> Reporter: Menelaos Karavelas
> Priority: Major
> Labels: SPIP, releasenotes
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely
> no jargon.*
> Add two new data types to Spark SQL and PySpark, GEOMETRY and GEOGRAPHY, for
> handling location and shape data, specifically:
> # GEOMETRY: For working with shapes on a Cartesian space (a flat surface,
> like a paper map)
> # GEOGRAPHY: For working with shapes on an ellipsoidal surface (like the
> Earth's surface)
> These will let Spark users work with standard shape types like:
> * POINTS, MULTIPOINTS
> * LINESTRINGS, MULTILINESTRINGS
> * POLYGONS, MULTIPOLYGONS
> * GEOMETRY COLLECTIONS
> These are [standard shape
> types|https://portal.ogc.org/files/?artifact_id=25355] as defined by the
> [Open Geospatial Consortium (OGC)|https://www.ogc.org/]. OGC is the
> international governing body for standardizing Geographic Information Systems
> (GIS).
> New SQL types will be parametrized with an integer value (Spatial Reference
> IDentifier - SRID) that defines the underlying coordinate reference system
> that geospatial values live in (like GPS coordinates vs. local measurements).
> Popular data storage formats
> ([Parquet|https://github.com/apache/parquet-format/blob/master/Geospatial.md]
> and [Iceberg|https://github.com/apache/iceberg/pull/10981]) are adding
> support for GEOMETRY and GEOGRAPHY, and the proposal is to extend Spark such
> that it can read / write and operate with new geo types.
>
> *Q2. What problem is this proposal NOT designed to solve?*
> Build a comprehensive system for advanced geospatial analysis in Spark. While
> we're adding basic support for storing and reading location/shape data, we
> are not:
> * Creating complex geospatial processing functions
> * Building spatial analysis tools
> * Developing geometric or geographic calculations and transformations
> This proposal is laying the foundation - building the infrastructure to
> handle geospatial data, but not creating a full-featured geospatial
> processing system. Such extension can be done in existing frameworks like
> Apache Sedona (a well-known Apache project that extends Spark to add support
> for geospatial processing), or can be done later as a separate improvement.
>
> *Q3. How is it done today, and what are the limits of current practice?*
> Current State:
> # Users store geospatial data by converting it into basic text (string) or
> binary formats because Spark doesn't understand geospatial data directly.
> # To work with this data, users need to use external tools to make sense of
> and process the data.
> Limitations:
> * Have to use external libs
> * Have to do conversions
> * Cannot write / read geospatial values in a native way
> * External tools cannot natively understand the data as geospatial
> * No file / data skipping
>
> *Q4. What is new in your approach and why do you think it will be successful?*
> The high-level approach is not new, and we have a clear picture of how to
> split the work by sub-tasks based on our experience of adding new types such
> as ANSI intervals and TIMESTAMP_NTZ.
> Most existing data processing systems support working with Geospatial data.
>
> *Q5. Who cares? If you are successful, what difference will it make?*
> New types create the foundation for working with geospatial data on Spark.
> Other data analytics systems such as PostgreSQL, Redshift, Snowflake, Big
> Query, all have geospatial support.
> Spark stays relevant and compatible with the latest Parquet and Iceberg
> community developments. Projects like Apache Sedona can capitalize on the
> work proposed in this document to benefit from built-in types and storage
> support.
>
> *Q6. What are the risks?*
> The addition of the new data types will allow for strictly new workloads, so
> the risks in this direction are minimal. The only overlap with existing
> functionality is at the type system level (e.g., casts between the new
> geospatial types and the existing Spark SQL types). The risk is low however,
> and can be handled through testing.
>
> *Q7. How long will it take?*
> In total it might take around 9 months. The estimation is based on similar
> tasks: ANSI intervals (SPARK-27790) and TIMESTAMP_NTZ (SPARK-35662). We can
> split the work by function blocks:
> # Base functionality - 2 weeks
> Add new type geospatial types, literals, type constructor, and external types.
> # Persistence - 2.5 months
> Ability to create parquet tables of the type GEOSPATIAL, read/write from/to
> Parquet and other built-in data types, stats, predicate push down.
> # Basic data skipping operator - 1 months
> # Implement rudimentary operators for data skipping - (e.g.,
> ST_BoxIntersects - checks if the bounding boxes of two geospatial values
> intersect each other)
> # Clients support - 1 month
> JDBC, Hive, Thrift server, connect
> # PySpark integration - 1 month
> DataFrame support, pandas API, python UDFs, Arrow column vectors
> # Docs + testing/benchmarking - 1 month
>
> *Q8. What are the mid-term and final “exams” to check for success?*
> Mid-term criteria: New type definitions and read/write geospatial data from
> Parquet.
> Final criteria: Equivalent functionality with any other scalar data type.
>
> *Appendix A. Proposed API Changes.*
> h4. _Geospatial data types_
> We propose to support two new parametrized data types: GEOMETRY and
> GEOGRAPHY. Both take as input a non-negative integer value called SRID
> (referring to a spatial reference identifier). See Appendix B for more
> details on the type system.
> Examples:
>
> ||Syntax||Comments||
> |GEOMETRY(0)|Materializable, every row has SRID 0|
> |GEOMETRY(ANY)|Not materializable, every row can have different SRID|
> |GEOGRAPHY(4326) |Materializable, every row has SRID 4326|
> |GEOGRAPHY(ANY)|Not materializable, every row can have different SRID|
> GEOMETRY(ANY) and GEOGRAPHY(ANY) act as least common types among the
> parametrized GEOMETRY and GEOGRAPHY types.
> h4. _Table creation_
> Users will be able to create GEOMETRY(<srid>) or GEOGRAPHY(<srid>) columns:
>
> {code:java}
> CREATE TABLE tbl (geom GEOMETRY(0), geog GEOGRAPHY(4326)); {code}
>
> Users will not be able to create GEOMETRY(ANY) or GEOGRAPHY(ANY) columns. See
> details in Appendix B.
> Users will be able to insert geospatial values to such columns:
>
> {code:java}
> INSERT INTO tbl
> VALUES(X‘0101000000000000000000f03f0000000000000040’::GEOMETRY(0),
> X’0101000000000000000000f03f0000000000000040’::GEOGRAPHY(4326)); {code}
>
> h4. _Casts_
> The following explicit casts as allowed (all other casts are initially
> disallowed):
> * GEOMETRY/GEOGRAPHY column to BINARY (output is WKB format)
> * BINARY to GEOMETRY/GEOGRAPHY column (input is expected to be in WKB format)
> * GEOMETRY(ANY) to GEOMETRY(<srid>)
> * GEOGRAPHY(ANY) to GEOGRAPHY(<srid>)
> The following implicit casts are allowed (also supported as explicit):
> * GEOMETRY(<srid>) to GEOMETRY(ANY)
> * GEOGRAPHY(<srid>) to GEOGRAPHY(ANY)
> h4. _Geospatial expressions_
> The following geospatial expressions are to be exposed to users:
> * Scalar expressions
> ** Boolean ST_BoxesIntersect(BoxStruct1, BoxStruct2)
> ** BoxStruct ST_Box2D(GeoVal)
> ** BinaryVal ST_AsBinary(GeoVal)
> ** GeoVal ST_GeomFromWKB(BinaryVal, IntVal)
> ** GeoVal ST_GeogFromWKB(BinaryVal, IntVal)
> * Aggregate expressions
> ** BoxStruct ST_Extent(GeoCol)
> BoxStruct above refers to a struct of 4 double values.
> h4. _Ecosystem_
> Support for SQL, Dataframe, and PySpark APIs including Spark Connect APIs.
>
> *Appendix B.*
> h4. _Type System_
> We propose to introduce two parametrized types: GEOMETRY and GEOGRAPHY. The
> parameter would either be a non-negative integer value or the special
> specifier ANY.
> * The integer value represents a Spatial Reference IDentifier (SRID) which
> uniquely defines the coordinate reference system. These integers will be
> mapped to coordinate reference system (CRS) definitions defined by
> authorities like OGC, EPSG, or ESRI.
> * The ANY specifier refers to GEOMETRY or GEOGRAPHY columns where the SRID
> value can be different across the column rows. GEOMETRY(ANY) and
> GEOGRAPHY(ANY) will not be materialized. This constraint is imposed by the
> storage specifications (the Parquet and Iceberg specs require a unique SRID
> value per GEOMETRY or GEOGRAPHY column).
> * GEOMETRY(ANY) and GEOGRAPHY(ANY) act as the least common type across all
> GEOMETRY and GEOGRAPHY parametrized types. They are also needed for the
> proposed ST_GeomFromWKB and ST_GeogFromWKB expressions when the second
> argument is not foldable; for non-foldable arguments the return type for
> these expressions is GEOMETRY(ANY) and GEOGRAPHY(ANY), respectively.
> * All known SRID values are positive integers. For the GEOMETRY data type we
> will also allow the SRID value of 0. This is the typical way of specifying
> geometric values with unknown or unspecified underlying CRS. Mathematically
> these values are understood as embedded in a Cartesian space.
> h4. _In-memory representation_
> We propose to internally represent geospatial values as a byte array
> containing the concatenation of the BINARY value and an INTEGER value. The
> BINARY value is the WKB representation of the GEOMETRY or GEOGRAPHY value.
> The INTEGER value is the SRID value. WKB stands for Well-Known Binary and it
> is one of the standard formats for representing geospatial values (see
> [https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary]).
> WKB is also the format used by the latest Parquet and Iceberg specs for
> representing geospatial values.
> h4. _Serialization_
> We propose to only serialize GEOMETRY and GEOGRAPHY columns that have a
> numeric SRID value. For such columns the SRID value can be deciphered from
> the column metadata. The serialization format for values will be WKB.
> h4. _Casting_
> Implicit casting from GEOMETRY(<srid>) to GEOMETRY(ANY) and GEOGRAPHY(<srid>)
> to GEOGRAPHY(ANY) is primarily introduced in order to smoothly support
> expressions like CASE/WHEN, COALESCE, NVL2, etc., that depend on least common
> type logic.
> h4. _Dependencies_
> We propose to add a build dependency on the Proj library
> ([https://proj.org/en/stable]) for generating a list of supported CRS
> definitions. We will need to expand this list with SRID 0 which is necessary
> when dealing with geometry inputs with unspecified SRID.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]