Re: [DISCUSS] SPIP: Add geospatial types to Spark

Jia Yu Sun, 30 Mar 2025 00:27:40 -0700

Hey Angel,

I am glad that you asked these questions. Please see my answers below.



*1. Domain types evolve quickly. - It has taken years for Parquet to
include these new types in its format... We could evolve alongside Parquet.
Unfortunately, Spark is not known for upgrading its dependencies quickly.*

Exactly — domain-specific types evolve rapidly and may head in directions
that aren’t fully aligned with formats like Parquet, Avro, and others. In
such cases, should Spark, as a general-purpose compute engine, really be
tightly coupled to the specifics of a single storage format?

Personally, I really appreciate Spark’s UserDefinedType mechanism and
Apache Arrow’s ExtensionType — both offer maximum flexibility while keeping
the core engine clean and extensible.



* 2. Geospatial in Java and Python is a dependency hell.- How has Parquet
solved that problem, then?*

Exactly — this problem is not fully solved by Parquet. While the Parquet
spec now includes a definition for geospatial types, it’s more of a vision
than a complete, production-ready solution. Many aspects of the spec are
not yet implemented in Spark. In fact, the spec represents a compromise
among multiple vendors (e.g., BigQuery, Snowflake), and many design choices
are not aligned with Spark’s architecture or ecosystem.

For example:
• The CRS property in the spec uses a PROJJSON string, which currently only
has a C++ implementation — there is no Java implementation available.
• The edge interpolation algorithms (e.g., for great-circle arcs) mentioned
in the spec also only exist in C++ libraries.
• Handling of antimeridian-crossing geometries is another complex topic
that isn’t addressed in Spark today.

The Sedona community is actively working on solutions — either building
Java equivalents for these features or creating workarounds. These are
deeply domain-specific efforts and often require non-trivial geospatial
expertise.

We are currently contributing a Java implementation of the Parquet Geo
format here: https://github.com/apache/parquet-java/pull/2971

In Python, geospatial manipulation depends on libraries like Shapely and
GeoPandas, which evolve quickly and frequently introduce breaking changes.
Sedona has invested significant effort to maintain compatibility and
stability for Python UDFs across these ecosystems.

If you haven’t encountered this kind of “dependency hell” while working on
geospatial projects with Spark, you may have been fortunate to deal with
relatively simple cases — e.g., only working with point data or simple
polygons.

That usually means:
1. All geometries are in a single CRS, typically WGS84 (SRID 4326)
2. No antimeridian-crossing geometries
3. No need for high-precision distance calculations or spherical geometry
4. No need to handle topology or wraparound issues

If that’s the case, then Spark already works fine as-is for your use case —
so why complicate it?


*3. Sedona already supports Geo fully in (Geo)Parquet.- The default format
in Spark is Parquet, and Parquet now natively supports these types. Are we
going to force users to add Sedona?*

While opinions may vary, I would encourage users to adopt a solution like
Apache Sedona that laser focuses on geospatial. Sedona provides
comprehensive, step-by-step tutorials on how to handle geospatial
dependencies across major platforms — including Databricks, AWS EMR,
Microsoft Azure, and Google Cloud. We’re also actively collaborating with
cloud providers to bundle Sedona natively into their offerings, making it
even easier for users to get started.


That said, I generally share the same perspective — if the Spark community
believes it would benefit from having basic geospatial support built in,
the Sedona community would be happy to collaborate on this effort. We’re
open to contributing the necessary functionality and, if appropriate,
having Spark depend on Sedona directly to avoid reinvention.

Thanks,
JIa



On Sat, Mar 29, 2025 at 11:02 PM Ángel Álvarez Pascua <
angel.alvarez.pas...@gmail.com> wrote:

>
> * 1.      Domain types evolve quickly.*
> It has taken years for Parquet to include these new types in its format...
> We could evolve alongside Parquet. Unfortunately, Spark is not known for
> upgrading its dependencies quickly.
>
> * 2.      Geospatial in Java and Python is a dependency hell.*
> How has Parquet solved that problem, then? I don't recall experiencing any
> "dependency hell" when working on geospatial projects with Spark, to be
> honest. Besides, Spark already includes Parquet as a dependency, so...
> where is the problem?
>
> *3.      Sedona already supports Geo fully in (Geo)Parquet.*
> The default format in Spark is Parquet, and Parquet now natively supports
> these types. Are we going to force users to add Sedona (along with all its
> third-party dependencies, I assume) to their projects just for reading,
> writing, and performing basic operations with these types?
>
> Anyway, let's vote and see...
>
> El sáb, 29 mar 2025 a las 22:41, Reynold Xin (<r...@databricks.com.invalid>)
> escribió:
>
>> While I don’t think Spark should become a super specialized geospatial
>> processing engine, I don’t think it makes sense to focus *only* on reading
>> and writing from storage. Geospatial is a pretty common and fundamental
>> capability of analytics systems and virtually every mature and popular
>> analytics systems, be it open source or proprietary, storage or query, has
>> some basic geospatial data type and support. Adding geospatial type and
>> some basic expressions is such a no brainer.
>>
>> On Sat, Mar 29, 2025 at 2:27 PM Jia Yu <ji...@apache.org> wrote:
>>
>>> Hi Wenchen, Menelaos and Szehon,
>>>
>>> Thanks for the clarification — I’m glad to hear the primary motivation
>>> of this SPIP is focused on reading and writing geospatial data with Parquet
>>> and Iceberg. That’s an important goal, and I want to highlight that this
>>> problem is being solved by the Apache Sedona community.
>>>
>>> Since the primary motivation here is Parquet-level support, I suggest
>>> shifting the focus of this discussion toward enabling geo support in Spark
>>> Parquet DataSource rather than introducing core types.
>>>
>>> ** Why Spark Should Avoid Hardcoding Domain-Specific Types like geo
>>> types **
>>>
>>>         1.      Domain types evolve quickly.
>>>
>>> In geospatial, we already have geometry, geography, raster, trajectory,
>>> point clouds — and the list keeps growing. In AI/ML, we’re seeing tensors,
>>> vectors, and multi-dimensional arrays. Spark’s strength has always been in
>>> its general-purpose architecture and extensibility. Introducing hardcoded
>>> support for fast-changing domain-specific types risks long-term maintenance
>>> issues and eventual incompatibility with emerging standards.
>>>
>>>         2.      Geospatial in Java and Python is a dependency hell.
>>>
>>> There are multiple competing geometry libraries with incompatible APIs.
>>> No widely adopted Java library supports geography types. The most
>>> authoritative CRS dataset (EPSG) is not Apache-compatible. The json format
>>> for CRS definitions (projjson) is only fully supported in PROJ, a C++
>>> library without a Java equivalent and no formal OGC standard status. On the
>>> Python side, this might involve Shapely and GeoPandas dependencies.
>>>
>>>         3.      Sedona already supports Geo fully in (Geo)Parquet.
>>>
>>> Sedona has supported reading, writing, metadata preservation, and data
>>> skipping for GeoParquet (predecessor of Parquet Geo) for over two years
>>> [2][3]. These features are production-tested and widely used.
>>>
>>> ** Proposed Path Forward: Geo Support via Spark Extensions **
>>>
>>> To enable seamless Parquet integration without burdening Spark core,
>>> here are two options:
>>>
>>> Option 1:
>>> Sedona offers a dedicated `parquet-geo` DataSource that handles type
>>> encoding, metadata, and data skipping. No changes to Spark are required.
>>> This is already underway and will be maintained by the Sedona community to
>>> keep up with the evolving Geo standards.
>>>
>>> Option 2:
>>> Spark provides hooks to inject:
>>> - custom logical types / user-defined types (UDTs)
>>> - custom statistics and filter pushdowns
>>> Sedona can then extend the built-in `parquet` DataSource to integrate
>>> geo type metadata, predicate pushdown, and serialization seamlessly.
>>>
>>> For Iceberg, we’ve already published a proof-of-concept connector [4]
>>> showing Sedona, Spark, and Iceberg working together without any Spark core
>>> changes [5].
>>>
>>> ** On the Bigger Picture **
>>>
>>> I also agree with your long-term vision. I believe Spark is on the path
>>> to becoming a foundational compute engine — much like Postgres or Pandas —
>>> where the core remains focused and stable, while powerful domain-specific
>>> capabilities emerge from its ecosystem.
>>>
>>> To support this future, Spark could prioritize flexible extension hooks
>>> so that third-party libraries can thrive — just like we’ve seen with
>>> PostGIS, pgvector, TimescaleDB in the Postgres ecosystem, and GeoPandas in
>>> the Pandas ecosystem.
>>>
>>> Sedona is following this model by building geospatial support around
>>> Spark — not inside it — and we’d love to continue collaborating in this
>>> spirit.
>>>
>>> Happy to work together on providing Geo support in Parquet!
>>>
>>> Best,
>>> Jia
>>>
>>> References
>>>
>>> [1] GeoParquet project:
>>> https://github.com/opengeospatial/geoparquet
>>>
>>> [2] Sedona’s GeoParquet DataSource implementation:
>>>
>>> https://github.com/apache/sedona/tree/master/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet
>>>
>>> [3] Sedona’s GeoParquet documentation:
>>> https://sedona.apache.org/latest/tutorial/files/geoparquet-sedona-spark/
>>>
>>> [4] Sedona-Iceberg connector (PoC):
>>> https://github.com/wherobots/sedona-iceberg-connector
>>>
>>> [5] Spark-Sedona-Iceberg working example:
>>>
>>> https://github.com/wherobots/sedona-iceberg-connector/blob/main/src/test/scala/com/wherobots/sedona/TestGeospatial.scala#L53
>>>
>>>
>>> On 2025/03/29 19:27:08 Menelaos Karavelas wrote:
>>> > To continue along the line of thought of Szehon:
>>> >
>>> > I am really excited that the Parquet and Iceberg communities have
>>> adopted geospatial logical types and of course I am grateful for the work
>>> put in that direction.
>>> >
>>> > As both Wenchen and Szehon pointed out in their own way, the goal is
>>> to have minimal support in Spark, as a common platform, for these types.
>>> >
>>> > To be more specific and explicit: The proposal scope is to add support
>>> for reading/writing to Parquet, based on the new standard, as well as
>>> adding the types as built-in types in Spark to complement the storage
>>> support. The few ST expressions that are in the proposal are what seem to
>>> be the minimal set of expressions needed to support working with geospatial
>>> values in the Spark engine in a meaningful way.
>>> >
>>> > Best,
>>> >
>>> > Menelaos
>>> >
>>> >
>>> > > On Mar 29, 2025, at 12:06 PM, Szehon Ho <szehon.apa...@gmail.com>
>>> wrote:
>>> > >
>>> > > Thank you Menelaos, will do!
>>> > >
>>> > > To give a little background, Jia and Sedona community, also
>>> GeoParquet community, and others really put much effort contributing to
>>> defining the Parquet and Iceberg geo types, which couldn't be done without
>>> their experience and help!
>>> > >
>>> > > But I do agree with Wenchen , now that the types are in most common
>>> data sources in ecosystem , I think Apache Spark as a common platform needs
>>> to have this type definition for inter-op, otherwise users of vanilla Spark
>>> cannot work with those data sources with stored geospatial data.  (Imo a
>>> similar rationale in adding timestamp nano in the other ongoing SPIP.).
>>> > >
>>> > > And like Wenchen said, the SPIP’s goal doesnt seem to be to fragment
>>> the ecosystem by implementing Sedona’s advanced geospatial analytic tech in
>>> Spark itself, which you may be right belongs in pluggable frameworks.
>>> Menelaus may explain more about the SPIP goal.
>>> > >
>>> > > I do hope there can be more collaboration across communities (like
>>> in Iceberg/Parquet collaboration) in getting Sedona community’s experience
>>> in making sure these type definitions are optimal , and compatible for
>>> Sedona.
>>> > >
>>> > > Thanks!
>>> > > Szehon
>>> > >
>>> > >
>>> > >> On Mar 29, 2025, at 8:04 AM, Menelaos Karavelas <
>>> menelaos.karave...@gmail.com> wrote:
>>> > >>
>>> > >> 
>>> > >> Hello Szehon,
>>> > >>
>>> > >> I just created a Google doc and also linked it in the JIRA:
>>> > >>
>>> > >>
>>> https://docs.google.com/document/d/1cYSNPGh95OjnpS0k_KDHGM9Ae3j-_0Wnc_eGBZL4D3w/edit?tab=t.0
>>> > >>
>>> > >> Please feel free to comment on it.
>>> > >>
>>> > >> Best,
>>> > >>
>>> > >> Menelaos
>>> > >>
>>> > >>
>>> > >>> On Mar 28, 2025, at 2:19 PM, Szehon Ho <szehon.apa...@gmail.com>
>>> wrote:
>>> > >>>
>>> > >>> Thanks Menelaos, this is exciting !  Is there a google doc we can
>>> comment, or just on the JIRA?
>>> > >>>
>>> > >>> Thanks
>>> > >>> Szehon
>>> > >>>
>>> > >>> On Fri, Mar 28, 2025 at 1:41 PM Ángel Álvarez Pascua <
>>> angel.alvarez.pas...@gmail.com <mailto:angel.alvarez.pas...@gmail.com>>
>>> wrote:
>>> > >>>> Sorry, I only had a quick look at the proposal, looked for WKT
>>> and didn't find anything.
>>> > >>>>
>>> > >>>> It's been years since I worked on geospatial projects and I'm not
>>> an expert (at all). Maybe starting with something simple but useful like
>>> conversion WKT<=>WKB?
>>> > >>>>
>>> > >>>>
>>> > >>>> El vie, 28 mar 2025, 21:27, Menelaos Karavelas <
>>> menelaos.karave...@gmail.com <mailto:menelaos.karave...@gmail.com>>
>>> escribió:
>>> > >>>>> In the SPIP Jira the proposal is to add the expressions
>>> ST_AsBinary, ST_GeomFromWKB, and ST_GeogFromWKB.
>>> > >>>>> Is there anything else that you think should be added?
>>> > >>>>>
>>> > >>>>> Regarding WKT, what do you think should be added?
>>> > >>>>>
>>> > >>>>> - Menelaos
>>> > >>>>>
>>> > >>>>>
>>> > >>>>>> On Mar 28, 2025, at 1:02 PM, Ángel Álvarez Pascua <
>>> angel.alvarez.pas...@gmail.com <mailto:angel.alvarez.pas...@gmail.com>>
>>> wrote:
>>> > >>>>>>
>>> > >>>>>> What about adding support for WKT <
>>> https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry>/WKB
>>> <
>>> https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
>>> >?
>>> > >>>>>>
>>> > >>>>>> El vie, 28 mar 2025 a las 20:50, Ángel Álvarez Pascua (<
>>> angel.alvarez.pas...@gmail.com <mailto:angel.alvarez.pas...@gmail.com>>)
>>> escribió:
>>> > >>>>>>> +1 (non-binding)
>>> > >>>>>>>
>>> > >>>>>>> El vie, 28 mar 2025, 18:48, Menelaos Karavelas <
>>> menelaos.karave...@gmail.com <mailto:menelaos.karave...@gmail.com>>
>>> escribió:
>>> > >>>>>>>> Dear Spark community,
>>> > >>>>>>>>
>>> > >>>>>>>> I would like to propose the addition of new geospatial data
>>> types (GEOMETRY and GEOGRAPHY) which represent geospatial values as
>>> recently added as new logical types in the Parquet specification.
>>> > >>>>>>>>
>>> > >>>>>>>> The new types should improve Spark’s ability to read the new
>>> Parquet logical types and perform some minimal meaningful operations on
>>> them.
>>> > >>>>>>>>
>>> > >>>>>>>> SPIP: https://issues.apache.org/jira/browse/SPARK-51658
>>> > >>>>>>>>
>>> > >>>>>>>> Looking forward to your comments and feedback.
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> Best regards,
>>> > >>>>>>>>
>>> > >>>>>>>> Menelaos Karavelas
>>> > >>>>>>>>
>>> > >>>>>
>>> > >>
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: [DISCUSS] SPIP: Add geospatial types to Spark

Reply via email to