andygrove opened a new issue, #4455:
URL: https://github.com/apache/datafusion-comet/issues/4455
## Background
PR #4423 proposes adding 40 native geospatial SQL functions (`ST_Contains`,
`ST_Intersects`, `ST_Distance`, etc.) directly into Comet, executed via
DataFusion using new Rust dependencies (`geo`, `geoarrow`, `geojson`, `geos`,
`wkt`). The functions are wired through `CometSparkSessionExtensions` so that
users get them automatically once Comet is enabled — no Sedona dependency
required.
This represents a potential shift in scope for Comet, which to date has
focused on accelerating Spark's built-in expressions and operators. The
discussion on the PR raised several open questions that deserve broader
community input, ideally on the dev mailing list as well as here.
## Questions for discussion
1. **Should geospatial support be in scope for Comet at all?** Comet has
historically targeted Spark built-ins (math, string, datetime, aggregates,
etc.). `ST_*` functions are not Spark built-ins — they come from Apache Sedona
or other extensions. Adding them would expand Comet's surface area into a
domain that already has dedicated projects.
2. **If yes, what is the right implementation path?**
- **In-tree, maintained by Comet** — as proposed in #4423. Comet owns the
function definitions, tests, and dependencies (including the GEOS C library via
`geos` crate with static linking).
- **Wrap SedonaDB** — @paleolimbot noted that SedonaDB has ~100 functions
plus join/Parquet IO already implemented, tested, and benchmarked in Rust.
Comet could wrap those, limiting Comet's maintenance burden to thin wrapper
code.
- **Defer to Sedona / Wherobots** — users who need geo today already have
options (Sedona on Spark; Wherobots offers a Rust-accelerated path). Comet
could choose not to enter this space.
3. **Geometry representation.** The PR uses WKT strings. @paleolimbot
pointed out that Spark, Parquet, and SedonaDB all use WKB, which is
significantly faster (\"the equivalent of passing around doubles as strings\").
If Comet adopts geo, what is the right representation, and does that depend on
broader UDT support?
4. **UDT / Spark geometry type support.** @paleolimbot mentioned that full
UDT support would require changing many `DataType` usages to `FieldRef` usages.
Spark geometry has a type parameter that is dropped when represented as Utf8.
Is this a prerequisite for doing geo \"properly,\" and is it work the project
wants to take on?
5. **Build and runtime dependencies.** The proposed approach adds a native
dependency on GEOS (statically linked, so end users don't need it at runtime,
but build machines do). How does the community feel about adding a C library
dependency to the Comet build?
6. **Maintenance burden.** Geo functions are a large surface area (the PR
adds 40; SedonaDB has ~100+). Who maintains them, who reviews changes, and who
handles compatibility as Sedona/Spark evolve?
## References
- PR #4423 — https://github.com/apache/datafusion-comet/pull/4423
- SedonaDB — https://github.com/apache/sedona
- Wherobots — https://wherobots.com/
## Next steps
@andygrove suggested taking this to the dev@ mailing list given the scope
shift implications. This issue is intended to collect written input from
contributors and users before/alongside that discussion. Please weigh in with
your perspective, especially if you have a use case for geo in Comet or
experience maintaining geospatial libraries.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]