paleolimbot commented on issue #264:
URL: https://github.com/apache/sedona-db/issues/264#issuecomment-3471007912
Thanks for opening!
There's a few things going on here:
- Data sources ("tables") are DataFusion `TableProvider`s. TableProviders
receive a "projection" (requested columns) and a filter expression.
- A common special case of the table provider is reading one or more "file"s
(specifically, objects on an object store). These are implemented using the
`FileFormat` API (for which a `TableProvider` can be constructed using a
`ListingTable`). This is how we implement (Geo)Parquet (by wrapping the
`ParquetFileFormat`), and it's what makes `SELECT * FROM 'foofy.parquet'` work
in our SQL. I think technically it also powers `COPY TO/FROM` but I never
actually remember the syntax for that long enough to use it.
- `st_read()` (or `read_parquet()`) are user-defined table functions. Table
functions are just functions that accept scalar values and return a
`TableProvider`. A slight hiccup is that they aren't `async` and need a fully
resolved schema, so we have to have an `Arc<Runtime>` + `block_on` for most
realistic applications, including constructing and returning a `ListingTable`
- We focused on providing `read_xxx()` functions in Python/R before SQL
because they're easier to use and easier for a user to access the documentation
while typing the code. Conceptually the arguments are the same.
- We want both SedonaDB and SedonaSpark to be great and are happy to merge
great ideas to either one! (They have to start somewhere!)
- I'm working on https://github.com/apache/sedona-db/pull/251 to make
wrapping `ArrowArrayStream`-based formats (like GDAL!) easier. Basically, if
you can get me an Arrow Schema and an Arrow record batch reader from a URI, you
get the multi-file reader for free. I think I can have that ready tomorrow.
- All this applies equally to raster (it's just a column data type)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]