james-willis opened a new pull request, #858:
URL: https://github.com/apache/sedona-db/pull/858

   ## Summary
   
   Adds support for loading zarrs as rasters. Each chunk becomes one row, 
similar to how each tiff becomes one row.
   
   ```sql
   SELECT raster FROM sd_read_zarr('file:///path/to/datacube.zarr');
   
   SELECT count(*) FROM sd_read_zarr(
       'file:///path/to/datacube.zarr',
       '{"mode": "outdb", "rows_per_batch": 256}'
   );
   ```
   
   Rasters table is returns as a single column table with the column named 
`raster`. For now, chunks are eagerly loaded as part of the sd_read_zarr 
operation. When lazy loading support lands in #849 I will add lazy zarr support 
and make that the default.
   
   We built on the zarrs 0.23 craterather than GDAL MultiDim because two GDAL 
gaps are independently disqualifying: Zarr v3 sharding is unreadable in GDAL ≤ 
3.12 (sharded Zarr is a critical cloud-native feature IMO), and vlen-utf8 
string coordinate variables are unreadable through GDAL MultiDim 
(climate/datacube Zarr uses these for band names).
   
   ## What's in this PR
   
   - New crate sedona-raster-zarr with `group_to_indb_rasters(uri)` and 
`group_to_outdb_rasters(uri)` entry points.
   - GeoZarr metadata parsing: proj:wkt2 / proj:projjson / proj:epsg (in that 
precedence) for CRS; spatial:transform + spatial:dims for affine + spatial axes.
   - Group-constraint validation at load time: shared chunk grid, chunk shape, 
and dim names across the group; named errors on the offending array.
   - `sd_read_zarr` UDTF registered in `SedonaContext::new_from_context`. JSON 
options: `mode` (default "indb"), `rows_per_batch` (default 1024), 
`num_partitions` (default 1).
   - Empty/partial projection support via ProjectionExec wrap, so `SELECT 
count(*) FROM sd_read_zarr(...)` works.
   
   ## Out of scope (separate follow-up PR)
   
   - lazy byte resolver — the function the byte-loading hook calls to fetch 
zarr chunks on demand. Depends on #849.
   - Cloud storage backends (zarrs_object_store for s3://, gs://, az://, 
https://) and the async runtime story they need. Phase 1 errors clearly on 
cloud schemes.
   - Round-robin chunk partitioning past `num_partitions = 1`.
   
   ## Open questions for reviewers
   
   1. Hard zarrs dependency on the sedona crate. Should I make the zarr crate 
optional for the top level dependency
   2. Naming. `sd_read_zarr` matches the existing `sd_random_geometry` UDTF in 
this repo. The wider DataFusion ecosystem also uses the <format>_scan pattern 
(e.g. delta_scan). Happy to rename to `sd_zarr_scan` if reviewers prefer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to