andygrove opened a new issue, #4455:
URL: https://github.com/apache/datafusion-comet/issues/4455

   ## Background
   
   PR #4423 proposes adding 40 native geospatial SQL functions (`ST_Contains`, 
`ST_Intersects`, `ST_Distance`, etc.) directly into Comet, executed via 
DataFusion using new Rust dependencies (`geo`, `geoarrow`, `geojson`, `geos`, 
`wkt`). The functions are wired through `CometSparkSessionExtensions` so that 
users get them automatically once Comet is enabled — no Sedona dependency 
required.
   
   This represents a potential shift in scope for Comet, which to date has 
focused on accelerating Spark's built-in expressions and operators. The 
discussion on the PR raised several open questions that deserve broader 
community input, ideally on the dev mailing list as well as here.
   
   ## Questions for discussion
   
   1. **Should geospatial support be in scope for Comet at all?** Comet has 
historically targeted Spark built-ins (math, string, datetime, aggregates, 
etc.). `ST_*` functions are not Spark built-ins — they come from Apache Sedona 
or other extensions. Adding them would expand Comet's surface area into a 
domain that already has dedicated projects.
   
   2. **If yes, what is the right implementation path?**
      - **In-tree, maintained by Comet** — as proposed in #4423. Comet owns the 
function definitions, tests, and dependencies (including the GEOS C library via 
`geos` crate with static linking).
      - **Wrap SedonaDB** — @paleolimbot noted that SedonaDB has ~100 functions 
plus join/Parquet IO already implemented, tested, and benchmarked in Rust. 
Comet could wrap those, limiting Comet's maintenance burden to thin wrapper 
code.
      - **Defer to Sedona / Wherobots** — users who need geo today already have 
options (Sedona on Spark; Wherobots offers a Rust-accelerated path). Comet 
could choose not to enter this space.
   
   3. **Geometry representation.** The PR uses WKT strings. @paleolimbot 
pointed out that Spark, Parquet, and SedonaDB all use WKB, which is 
significantly faster (\"the equivalent of passing around doubles as strings\"). 
If Comet adopts geo, what is the right representation, and does that depend on 
broader UDT support?
   
   4. **UDT / Spark geometry type support.** @paleolimbot mentioned that full 
UDT support would require changing many `DataType` usages to `FieldRef` usages. 
Spark geometry has a type parameter that is dropped when represented as Utf8. 
Is this a prerequisite for doing geo \"properly,\" and is it work the project 
wants to take on?
   
   5. **Build and runtime dependencies.** The proposed approach adds a native 
dependency on GEOS (statically linked, so end users don't need it at runtime, 
but build machines do). How does the community feel about adding a C library 
dependency to the Comet build?
   
   6. **Maintenance burden.** Geo functions are a large surface area (the PR 
adds 40; SedonaDB has ~100+). Who maintains them, who reviews changes, and who 
handles compatibility as Sedona/Spark evolve?
   
   ## References
   
   - PR #4423 — https://github.com/apache/datafusion-comet/pull/4423
   - SedonaDB — https://github.com/apache/sedona
   - Wherobots — https://wherobots.com/
   
   ## Next steps
   
   @andygrove suggested taking this to the dev@ mailing list given the scope 
shift implications. This issue is intended to collect written input from 
contributors and users before/alongside that discussion. Please weigh in with 
your perspective, especially if you have a use case for geo in Comet or 
experience maintaining geospatial libraries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to