Re: [DISCUSS] SPIP: Add geospatial types to Spark

Wenchen Fan Sat, 29 Mar 2025 05:24:07 -0700

Hi Jia,

This is a good question. As the shepherd of this SPIP, I'd like to clarify
the motivation here: the focus of this project is more about the storage
part, not the processing. Apache Sedona is a great library for geo
processing, but without native geo type support in Spark, users can't do
the following things:
- read the geo type columns from Parquet files (or other data sources)
directly
- write geo values into Parquet files (or other data sources) as native geo
types.
- push down geo predicates to the data source when reading


In the SPIP JIRA, we explicitly mentioned that "This proposal is laying the
foundation - building the infrastructure to handle geospatial data, but not
creating a full-featured geospatial processing system. Such extension can
be done later as a separate improvement." Maybe the right direction is to
not do it and leave it to third-party libraries.

The ultimate goal is to establish Spark as a comprehensive platform that
can connect to a rich ecosystem of third-party data sources and processing
libraries. For this project, we should definitely work with the Apache
Sedona community closely, to figure out the best protocol (what binary/text
format to use? how to represent geo values in Java? etc.)

Thanks,
Wenchen

On Sat, Mar 29, 2025 at 5:28 AM Jia Yu <ji...@apache.org> wrote:

> Dear Menelaos,
>
> Thanks for bringing this up again. I’ve seen similar proposals come up on
> the mailing list before, and I’d like to offer some thoughts.
>
> For full transparency, I’m Jia Yu, PMC Chair of Apache Sedona (
> https://github.com/apache/sedona), a widely used open-source cluster
> computing framework for processing large-scale geospatial data on Spark,
> Flink, and other engines.
>
> From what I understand, this proposal aims to add native geospatial types
> and functionality directly into Spark. However, this seems to replicate
> much of the work already done by the Sedona project over the past 10 years.
>
> Sedona has a strong and active community with well-established
> contribution guidelines. It is already used extensively with Spark in
> production—on platforms like Databricks, AWS EMR, Microsoft Fabric, and
> Google Cloud. Users simply add the Sedona jar, flip a Spark config, and it
> just works—similar to other mature Spark ecosystem libraries.
>
> The project sees over 2 million downloads per month across PyPI, Maven,
> etc., and has been downloaded more than 45 million times overall. Thousands
> of organizations rely on Sedona in production Spark environments.
>
> Sedona has also actively contributed to upstream ecosystem efforts, such
> as geospatial support in Parquet and Iceberg formats.
>
> Additionally, Sedona’s core technology has been peer-reviewed and
> published at top academic conferences. Its performance has been evaluated
> and benchmarked by many independent research articles:
> https://sedona.apache.org/latest/community/publication/
>
> Given all of this, I’m genuinely unsure what gap a new Spark-native effort
> is aiming to fill. If there’s a specific limitation that Sedona cannot
> address, I’d be eager to understand it. Otherwise, duplicating this
> functionality risks fragmenting the ecosystem and introducing confusion for
> current users. I would strongly advocate for close coordination with the
> Sedona community to avoid disruption and ensure alignment with the broader
> Spark ecosystem.
>
> Thanks again for raising this—we’re always happy to collaborate and
> strengthen the ecosystem together.
>
>
> Here is a quick overview of what Apache Sedona already offers on Spark:
>         •       Geospatial type support:
>         •       Vector: Geometry, partial Geography
>         •       Raster
>         •       Vector data sources:
>         •       GeoParquet (read/write), GeoJSON (read/write), Shapefile,
> GeoPackage, OpenStreetMap PBF
>         •       Raster data sources:
>         •       STAC catalog, GeoTiff (read/write), NetCDF/HDF
>         •       Functions:
>         •       209+ vector (ST_*) functions
>         •       100+ raster (RS_*) functions
>         •       GeoStats SQL: DBSCAN, hotspot analysis, outlier detection
>         •       Language support:
>         •       Scala, Java, SQL, Python, R
>         •       Query acceleration via R-Tree:
>         •       Distributed and broadcast spatial joins
>         •       KNN joins
>         •       Range queries
>         •       UDF support:
>         •       Scala UDFs (JTS), Python UDFs (Shapely, Rasterio, NumPy),
> Pandas UDFs
>         •       Serialization:
>         •       Custom serializers for geometry types
>         •       Ecosystem integrations:
>         •       Jupyter, Zeppelin, Apache Arrow, GeoPandas read/write,
> GeoPandas-like API
>
> Jia Yu
>
>
> On 2025/03/28 17:46:15 Menelaos Karavelas wrote:
> > Dear Spark community,
> >
> > I would like to propose the addition of new geospatial data types
> (GEOMETRY and GEOGRAPHY) which represent geospatial values as recently
> added as new logical types in the Parquet specification.
> >
> > The new types should improve Spark’s ability to read the new Parquet
> logical types and perform some minimal meaningful operations on them.
> >
> > SPIP: https://issues.apache.org/jira/browse/SPARK-51658
> >
> > Looking forward to your comments and feedback.
> >
> >
> > Best regards,
> >
> > Menelaos Karavelas
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] SPIP: Add geospatial types to Spark

Reply via email to