Re: [DISCUSS] SPIP: Add geospatial types to Spark

Jia Yu Sat, 29 Mar 2025 23:49:34 -0700

Hi Reynold and team,

I’m glad to see that the Spark community is recognizing the importance
of geospatial support. The Sedona community has long been a strong
advocate for Spark, and we’ve proudly supported large-scale geospatial
workloads on Spark for nearly a decade. We’re absolutely open to
collaborating and figuring out what’s best for the users together.


I’d like to give the community a bit more time to weigh in on the
scope of the SPIP — especially around the proposal’s current focus on
native types versus simply supporting reading/writing Parquet Geo
data.

While we wait, I took a quick look at the SPIP and noticed something a
bit surprising: as the most widely used geospatial framework in the
Spark ecosystem, Apache Sedona isn’t mentioned at all. Many of the
topics raised in the proposal — geometry types, geospatial
serialization, predicate pushdown, UDFs, and more — have already been
solved by Sedona, and are used at scale across Spark, Flink, and even
Snowflake.

I believe the SPIP JIRA ticket and the associated Google Doc would
benefit from a clearer explanation of the current state of the art,
including Apache Sedona — especially for community members who may not
be familiar with the existing geospatial ecosystem. Providing this
context can help prevent unnecessary reinvention and reduce the risk
of introducing incompatibilities with Sedona. Again, for context,
Sedona sees over 2 million downloads each month, and millions of
Sedona-Spark sessions are created by users monthly across various
platforms.

As sister projects under the Apache umbrella, Spark and Sedona should
aim to support and complement each other. The Sedona community is more
than willing to make adjustments on our side to ensure compatibility
and minimize disruption for users. We’re also happy to contribute
relevant portions of our code back to Spark where it makes sense.

For reference, here are a few relevant components that already exist in Sedona:
1. Geometry, geography, and raster UDTs
https://github.com/apache/sedona/tree/master/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT
2. Geospatial serializer
https://github.com/apache/sedona/tree/master/common/src/main/java/org/apache/sedona/common/geometrySerde
3. Geometry and Raster function catalog
https://github.com/apache/sedona/blob/master/spark/common/src/main/scala/org/apache/sedona/sql/UDF/Catalog.scala
4. Shared function implementations (used by Spark, Flink, Snowflake)
https://github.com/apache/sedona/tree/master/common/src/main/java/org/apache/sedona/common
5. Catalyst expressions for Sedona Spark
https://github.com/apache/sedona/tree/master/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions

Happy to chat further and hear your thoughts. Again, Sedona is an
Apache project — Spark is welcome to depend on Sedona and re-use any
of our work if helpful.

Thanks,
Jia

On Sat, Mar 29, 2025 at 2:41 PM Reynold Xin <[email protected]> wrote:
>
> While I don’t think Spark should become a super specialized geospatial 
> processing engine, I don’t think it makes sense to focus *only* on reading 
> and writing from storage. Geospatial is a pretty common and fundamental 
> capability of analytics systems and virtually every mature and popular 
> analytics systems, be it open source or proprietary, storage or query, has 
> some basic geospatial data type and support. Adding geospatial type and some 
> basic expressions is such a no brainer.
>
> On Sat, Mar 29, 2025 at 2:27 PM Jia Yu <[email protected]> wrote:
>>
>> Hi Wenchen, Menelaos and Szehon,
>>
>> Thanks for the clarification — I’m glad to hear the primary motivation of 
>> this SPIP is focused on reading and writing geospatial data with Parquet and 
>> Iceberg. That’s an important goal, and I want to highlight that this problem 
>> is being solved by the Apache Sedona community.
>>
>> Since the primary motivation here is Parquet-level support, I suggest 
>> shifting the focus of this discussion toward enabling geo support in Spark 
>> Parquet DataSource rather than introducing core types.
>>
>> ** Why Spark Should Avoid Hardcoding Domain-Specific Types like geo types **
>>
>>         1.      Domain types evolve quickly.
>>
>> In geospatial, we already have geometry, geography, raster, trajectory, 
>> point clouds — and the list keeps growing. In AI/ML, we’re seeing tensors, 
>> vectors, and multi-dimensional arrays. Spark’s strength has always been in 
>> its general-purpose architecture and extensibility. Introducing hardcoded 
>> support for fast-changing domain-specific types risks long-term maintenance 
>> issues and eventual incompatibility with emerging standards.
>>
>>         2.      Geospatial in Java and Python is a dependency hell.
>>
>> There are multiple competing geometry libraries with incompatible APIs. No 
>> widely adopted Java library supports geography types. The most authoritative 
>> CRS dataset (EPSG) is not Apache-compatible. The json format for CRS 
>> definitions (projjson) is only fully supported in PROJ, a C++ library 
>> without a Java equivalent and no formal OGC standard status. On the Python 
>> side, this might involve Shapely and GeoPandas dependencies.
>>
>>         3.      Sedona already supports Geo fully in (Geo)Parquet.
>>
>> Sedona has supported reading, writing, metadata preservation, and data 
>> skipping for GeoParquet (predecessor of Parquet Geo) for over two years 
>> [2][3]. These features are production-tested and widely used.
>>
>> ** Proposed Path Forward: Geo Support via Spark Extensions **
>>
>> To enable seamless Parquet integration without burdening Spark core, here 
>> are two options:
>>
>> Option 1:
>> Sedona offers a dedicated `parquet-geo` DataSource that handles type 
>> encoding, metadata, and data skipping. No changes to Spark are required. 
>> This is already underway and will be maintained by the Sedona community to 
>> keep up with the evolving Geo standards.
>>
>> Option 2:
>> Spark provides hooks to inject:
>> - custom logical types / user-defined types (UDTs)
>> - custom statistics and filter pushdowns
>> Sedona can then extend the built-in `parquet` DataSource to integrate geo 
>> type metadata, predicate pushdown, and serialization seamlessly.
>>
>> For Iceberg, we’ve already published a proof-of-concept connector [4] 
>> showing Sedona, Spark, and Iceberg working together without any Spark core 
>> changes [5].
>>
>> ** On the Bigger Picture **
>>
>> I also agree with your long-term vision. I believe Spark is on the path to 
>> becoming a foundational compute engine — much like Postgres or Pandas — 
>> where the core remains focused and stable, while powerful domain-specific 
>> capabilities emerge from its ecosystem.
>>
>> To support this future, Spark could prioritize flexible extension hooks so 
>> that third-party libraries can thrive — just like we’ve seen with PostGIS, 
>> pgvector, TimescaleDB in the Postgres ecosystem, and GeoPandas in the Pandas 
>> ecosystem.
>>
>> Sedona is following this model by building geospatial support around Spark — 
>> not inside it — and we’d love to continue collaborating in this spirit.
>>
>> Happy to work together on providing Geo support in Parquet!
>>
>> Best,
>> Jia
>>
>> References
>>
>> [1] GeoParquet project:
>> https://github.com/opengeospatial/geoparquet
>>
>> [2] Sedona’s GeoParquet DataSource implementation:
>> https://github.com/apache/sedona/tree/master/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet
>>
>> [3] Sedona’s GeoParquet documentation:
>> https://sedona.apache.org/latest/tutorial/files/geoparquet-sedona-spark/
>>
>> [4] Sedona-Iceberg connector (PoC):
>> https://github.com/wherobots/sedona-iceberg-connector
>>
>> [5] Spark-Sedona-Iceberg working example:
>> https://github.com/wherobots/sedona-iceberg-connector/blob/main/src/test/scala/com/wherobots/sedona/TestGeospatial.scala#L53
>>
>>
>> On 2025/03/29 19:27:08 Menelaos Karavelas wrote:
>> > To continue along the line of thought of Szehon:
>> >
>> > I am really excited that the Parquet and Iceberg communities have adopted 
>> > geospatial logical types and of course I am grateful for the work put in 
>> > that direction.
>> >
>> > As both Wenchen and Szehon pointed out in their own way, the goal is to 
>> > have minimal support in Spark, as a common platform, for these types.
>> >
>> > To be more specific and explicit: The proposal scope is to add support for 
>> > reading/writing to Parquet, based on the new standard, as well as adding 
>> > the types as built-in types in Spark to complement the storage support. 
>> > The few ST expressions that are in the proposal are what seem to be the 
>> > minimal set of expressions needed to support working with geospatial 
>> > values in the Spark engine in a meaningful way.
>> >
>> > Best,
>> >
>> > Menelaos
>> >
>> >
>> > > On Mar 29, 2025, at 12:06 PM, Szehon Ho <[email protected]> wrote:
>> > >
>> > > Thank you Menelaos, will do!
>> > >
>> > > To give a little background, Jia and Sedona community, also GeoParquet 
>> > > community, and others really put much effort contributing to defining 
>> > > the Parquet and Iceberg geo types, which couldn't be done without their 
>> > > experience and help!
>> > >
>> > > But I do agree with Wenchen , now that the types are in most common data 
>> > > sources in ecosystem , I think Apache Spark as a common platform needs 
>> > > to have this type definition for inter-op, otherwise users of vanilla 
>> > > Spark cannot work with those data sources with stored geospatial data.  
>> > > (Imo a similar rationale in adding timestamp nano in the other ongoing 
>> > > SPIP.).
>> > >
>> > > And like Wenchen said, the SPIP’s goal doesnt seem to be to fragment the 
>> > > ecosystem by implementing Sedona’s advanced geospatial analytic tech in 
>> > > Spark itself, which you may be right belongs in pluggable frameworks.  
>> > > Menelaus may explain more about the SPIP goal.
>> > >
>> > > I do hope there can be more collaboration across communities (like in 
>> > > Iceberg/Parquet collaboration) in getting Sedona community’s experience 
>> > > in making sure these type definitions are optimal , and compatible for 
>> > > Sedona.
>> > >
>> > > Thanks!
>> > > Szehon
>> > >
>> > >
>> > >> On Mar 29, 2025, at 8:04 AM, Menelaos Karavelas 
>> > >> <[email protected]> wrote:
>> > >>
>> > >> 
>> > >> Hello Szehon,
>> > >>
>> > >> I just created a Google doc and also linked it in the JIRA:
>> > >>
>> > >> https://docs.google.com/document/d/1cYSNPGh95OjnpS0k_KDHGM9Ae3j-_0Wnc_eGBZL4D3w/edit?tab=t.0
>> > >>
>> > >> Please feel free to comment on it.
>> > >>
>> > >> Best,
>> > >>
>> > >> Menelaos
>> > >>
>> > >>
>> > >>> On Mar 28, 2025, at 2:19 PM, Szehon Ho <[email protected]> wrote:
>> > >>>
>> > >>> Thanks Menelaos, this is exciting !  Is there a google doc we can 
>> > >>> comment, or just on the JIRA?
>> > >>>
>> > >>> Thanks
>> > >>> Szehon
>> > >>>
>> > >>> On Fri, Mar 28, 2025 at 1:41 PM Ángel Álvarez Pascua 
>> > >>> <[email protected] 
>> > >>> <mailto:[email protected]>> wrote:
>> > >>>> Sorry, I only had a quick look at the proposal, looked for WKT and 
>> > >>>> didn't find anything.
>> > >>>>
>> > >>>> It's been years since I worked on geospatial projects and I'm not an 
>> > >>>> expert (at all). Maybe starting with something simple but useful like 
>> > >>>> conversion WKT<=>WKB?
>> > >>>>
>> > >>>>
>> > >>>> El vie, 28 mar 2025, 21:27, Menelaos Karavelas 
>> > >>>> <[email protected] <mailto:[email protected]>> 
>> > >>>> escribió:
>> > >>>>> In the SPIP Jira the proposal is to add the expressions ST_AsBinary, 
>> > >>>>> ST_GeomFromWKB, and ST_GeogFromWKB.
>> > >>>>> Is there anything else that you think should be added?
>> > >>>>>
>> > >>>>> Regarding WKT, what do you think should be added?
>> > >>>>>
>> > >>>>> - Menelaos
>> > >>>>>
>> > >>>>>
>> > >>>>>> On Mar 28, 2025, at 1:02 PM, Ángel Álvarez Pascua 
>> > >>>>>> <[email protected] 
>> > >>>>>> <mailto:[email protected]>> wrote:
>> > >>>>>>
>> > >>>>>> What about adding support for WKT 
>> > >>>>>> <https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry>/WKB
>> > >>>>>>  
>> > >>>>>> <https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary>?
>> > >>>>>>
>> > >>>>>> El vie, 28 mar 2025 a las 20:50, Ángel Álvarez Pascua 
>> > >>>>>> (<[email protected] 
>> > >>>>>> <mailto:[email protected]>>) escribió:
>> > >>>>>>> +1 (non-binding)
>> > >>>>>>>
>> > >>>>>>> El vie, 28 mar 2025, 18:48, Menelaos Karavelas 
>> > >>>>>>> <[email protected] 
>> > >>>>>>> <mailto:[email protected]>> escribió:
>> > >>>>>>>> Dear Spark community,
>> > >>>>>>>>
>> > >>>>>>>> I would like to propose the addition of new geospatial data types 
>> > >>>>>>>> (GEOMETRY and GEOGRAPHY) which represent geospatial values as 
>> > >>>>>>>> recently added as new logical types in the Parquet specification.
>> > >>>>>>>>
>> > >>>>>>>> The new types should improve Spark’s ability to read the new 
>> > >>>>>>>> Parquet logical types and perform some minimal meaningful 
>> > >>>>>>>> operations on them.
>> > >>>>>>>>
>> > >>>>>>>> SPIP: https://issues.apache.org/jira/browse/SPARK-51658
>> > >>>>>>>>
>> > >>>>>>>> Looking forward to your comments and feedback.
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Best regards,
>> > >>>>>>>>
>> > >>>>>>>> Menelaos Karavelas
>> > >>>>>>>>
>> > >>>>>
>> > >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: [DISCUSS] SPIP: Add geospatial types to Spark

Reply via email to