Hi Jia, This is a good question. As the shepherd of this SPIP, I'd like to clarify the motivation here: the focus of this project is more about the storage part, not the processing. Apache Sedona is a great library for geo processing, but without native geo type support in Spark, users can't do the following things: - read the geo type columns from Parquet files (or other data sources) directly - write geo values into Parquet files (or other data sources) as native geo types. - push down geo predicates to the data source when reading
In the SPIP JIRA, we explicitly mentioned that "This proposal is laying the foundation - building the infrastructure to handle geospatial data, but not creating a full-featured geospatial processing system. Such extension can be done later as a separate improvement." Maybe the right direction is to not do it and leave it to third-party libraries. The ultimate goal is to establish Spark as a comprehensive platform that can connect to a rich ecosystem of third-party data sources and processing libraries. For this project, we should definitely work with the Apache Sedona community closely, to figure out the best protocol (what binary/text format to use? how to represent geo values in Java? etc.) Thanks, Wenchen On Sat, Mar 29, 2025 at 5:28 AM Jia Yu <ji...@apache.org> wrote: > Dear Menelaos, > > Thanks for bringing this up again. I’ve seen similar proposals come up on > the mailing list before, and I’d like to offer some thoughts. > > For full transparency, I’m Jia Yu, PMC Chair of Apache Sedona ( > https://github.com/apache/sedona), a widely used open-source cluster > computing framework for processing large-scale geospatial data on Spark, > Flink, and other engines. > > From what I understand, this proposal aims to add native geospatial types > and functionality directly into Spark. However, this seems to replicate > much of the work already done by the Sedona project over the past 10 years. > > Sedona has a strong and active community with well-established > contribution guidelines. It is already used extensively with Spark in > production—on platforms like Databricks, AWS EMR, Microsoft Fabric, and > Google Cloud. Users simply add the Sedona jar, flip a Spark config, and it > just works—similar to other mature Spark ecosystem libraries. > > The project sees over 2 million downloads per month across PyPI, Maven, > etc., and has been downloaded more than 45 million times overall. Thousands > of organizations rely on Sedona in production Spark environments. > > Sedona has also actively contributed to upstream ecosystem efforts, such > as geospatial support in Parquet and Iceberg formats. > > Additionally, Sedona’s core technology has been peer-reviewed and > published at top academic conferences. Its performance has been evaluated > and benchmarked by many independent research articles: > https://sedona.apache.org/latest/community/publication/ > > Given all of this, I’m genuinely unsure what gap a new Spark-native effort > is aiming to fill. If there’s a specific limitation that Sedona cannot > address, I’d be eager to understand it. Otherwise, duplicating this > functionality risks fragmenting the ecosystem and introducing confusion for > current users. I would strongly advocate for close coordination with the > Sedona community to avoid disruption and ensure alignment with the broader > Spark ecosystem. > > Thanks again for raising this—we’re always happy to collaborate and > strengthen the ecosystem together. > > > Here is a quick overview of what Apache Sedona already offers on Spark: > • Geospatial type support: > • Vector: Geometry, partial Geography > • Raster > • Vector data sources: > • GeoParquet (read/write), GeoJSON (read/write), Shapefile, > GeoPackage, OpenStreetMap PBF > • Raster data sources: > • STAC catalog, GeoTiff (read/write), NetCDF/HDF > • Functions: > • 209+ vector (ST_*) functions > • 100+ raster (RS_*) functions > • GeoStats SQL: DBSCAN, hotspot analysis, outlier detection > • Language support: > • Scala, Java, SQL, Python, R > • Query acceleration via R-Tree: > • Distributed and broadcast spatial joins > • KNN joins > • Range queries > • UDF support: > • Scala UDFs (JTS), Python UDFs (Shapely, Rasterio, NumPy), > Pandas UDFs > • Serialization: > • Custom serializers for geometry types > • Ecosystem integrations: > • Jupyter, Zeppelin, Apache Arrow, GeoPandas read/write, > GeoPandas-like API > > Jia Yu > > > On 2025/03/28 17:46:15 Menelaos Karavelas wrote: > > Dear Spark community, > > > > I would like to propose the addition of new geospatial data types > (GEOMETRY and GEOGRAPHY) which represent geospatial values as recently > added as new logical types in the Parquet specification. > > > > The new types should improve Spark’s ability to read the new Parquet > logical types and perform some minimal meaningful operations on them. > > > > SPIP: https://issues.apache.org/jira/browse/SPARK-51658 > > > > Looking forward to your comments and feedback. > > > > > > Best regards, > > > > Menelaos Karavelas > > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >