james-willis opened a new pull request, #858:
URL: https://github.com/apache/sedona-db/pull/858
## Summary
Adds support for loading zarrs as rasters. Each chunk becomes one row,
similar to how each tiff becomes one row.
```sql
SELECT raster FROM sd_read_zarr('file:///path/to/datacube.zarr');
SELECT count(*) FROM sd_read_zarr(
'file:///path/to/datacube.zarr',
'{"mode": "outdb", "rows_per_batch": 256}'
);
```
Rasters table is returns as a single column table with the column named
`raster`. For now, chunks are eagerly loaded as part of the sd_read_zarr
operation. When lazy loading support lands in #849 I will add lazy zarr support
and make that the default.
We built on the zarrs 0.23 craterather than GDAL MultiDim because two GDAL
gaps are independently disqualifying: Zarr v3 sharding is unreadable in GDAL ≤
3.12 (sharded Zarr is a critical cloud-native feature IMO), and vlen-utf8
string coordinate variables are unreadable through GDAL MultiDim
(climate/datacube Zarr uses these for band names).
## What's in this PR
- New crate sedona-raster-zarr with `group_to_indb_rasters(uri)` and
`group_to_outdb_rasters(uri)` entry points.
- GeoZarr metadata parsing: proj:wkt2 / proj:projjson / proj:epsg (in that
precedence) for CRS; spatial:transform + spatial:dims for affine + spatial axes.
- Group-constraint validation at load time: shared chunk grid, chunk shape,
and dim names across the group; named errors on the offending array.
- `sd_read_zarr` UDTF registered in `SedonaContext::new_from_context`. JSON
options: `mode` (default "indb"), `rows_per_batch` (default 1024),
`num_partitions` (default 1).
- Empty/partial projection support via ProjectionExec wrap, so `SELECT
count(*) FROM sd_read_zarr(...)` works.
## Out of scope (separate follow-up PR)
- lazy byte resolver — the function the byte-loading hook calls to fetch
zarr chunks on demand. Depends on #849.
- Cloud storage backends (zarrs_object_store for s3://, gs://, az://,
https://) and the async runtime story they need. Phase 1 errors clearly on
cloud schemes.
- Round-robin chunk partitioning past `num_partitions = 1`.
## Open questions for reviewers
1. Hard zarrs dependency on the sedona crate. Should I make the zarr crate
optional for the top level dependency
2. Naming. `sd_read_zarr` matches the existing `sd_random_geometry` UDTF in
this repo. The wider DataFusion ecosystem also uses the <format>_scan pattern
(e.g. delta_scan). Happy to rename to `sd_zarr_scan` if reviewers prefer
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]