paleolimbot commented on code in PR #249:
URL: https://github.com/apache/sedona-db/pull/249#discussion_r2469976727


##########
rust/sedona-schema/src/datatypes.rs:
##########
@@ -333,6 +365,1241 @@ fn deserialize_edges(edges: &Value) -> Result<Edges> {
     }
 }
 
+/// Schema for storing raster data in Apache Arrow format.
+/// Utilizing nested structs and lists to represent raster metadata and bands.
+#[derive(Debug, PartialEq, Clone)]
+pub struct RasterSchema;
+impl RasterSchema {
+    // Raster schema:
+    pub fn fields() -> Fields {
+        Fields::from(vec![
+            Field::new(column::METADATA, Self::metadata_type(), false),
+            Field::new(column::CRS, Self::crs_type(), true),
+            Field::new(column::BBOX, Self::bounding_box_type(), true),
+            Field::new(column::BANDS, Self::bands_type(), true),
+        ])
+    }
+
+    /// Raster metadata schema
+    pub fn metadata_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            // Raster dimensions
+            Field::new(column::WIDTH, DataType::UInt64, false),
+            Field::new(column::HEIGHT, DataType::UInt64, false),
+            // Geospatial transformation parameters
+            Field::new(column::UPPERLEFT_X, DataType::Float64, false),
+            Field::new(column::UPPERLEFT_Y, DataType::Float64, false),
+            Field::new(column::SCALE_X, DataType::Float64, false),
+            Field::new(column::SCALE_Y, DataType::Float64, false),
+            Field::new(column::SKEW_X, DataType::Float64, false),
+            Field::new(column::SKEW_Y, DataType::Float64, false),
+        ]))
+    }
+
+    /// Bounding box schema
+    pub fn bounding_box_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            Field::new(column::MIN_X, DataType::Float64, false),
+            Field::new(column::MIN_Y, DataType::Float64, false),
+            Field::new(column::MAX_X, DataType::Float64, false),
+            Field::new(column::MAX_Y, DataType::Float64, false),
+        ]))
+    }
+
+    /// Bands list schema
+    pub fn bands_type() -> DataType {
+        DataType::List(FieldRef::new(Field::new(
+            column::BAND,
+            Self::band_type(),
+            false,
+        )))
+    }
+
+    /// Individual band schema
+    pub fn band_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            Field::new(column::METADATA, Self::band_metadata_type(), false),
+            Field::new(column::DATA, Self::band_data_type(), false),
+        ]))
+    }
+
+    /// Band metadata schema
+    pub fn band_metadata_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            Field::new(column::NODATAVALUE, DataType::Binary, true), // Allow 
null nodata values
+            Field::new(column::STORAGE_TYPE, DataType::UInt32, false),
+            Field::new(column::DATATYPE, DataType::UInt32, false),
+            // OutDb reference fields - only used when storage_type == OutDbRef
+            Field::new(column::OUTDB_URL, DataType::Utf8, true),
+            Field::new(column::OUTDB_BAND_ID, DataType::UInt32, true),
+        ]))
+    }
+
+    /// Band data schema (single binary blob)
+    pub fn band_data_type() -> DataType {
+        DataType::Binary // consider switching to BinaryView
+    }
+
+    /// CRS schema to store json representation
+    pub fn crs_type() -> DataType {
+        DataType::Utf8 // TODO: Consider Utf8View
+    }

Review Comment:
   I think a Utf8View here is a good choice since this value will be frequently 
repeated, less than 12 bytes, or both!



##########
rust/sedona-schema/src/datatypes.rs:
##########
@@ -29,6 +39,7 @@ pub enum SedonaType {
     Arrow(DataType),
     Wkb(Edges, Crs),
     WkbView(Edges, Crs),
+    Raster(RasterSchema),

Review Comment:
   I think this can probably be unparameterized (i.e., `SedonaType::Raster`). 
My reading of this PR is that the underlying Arrow schema is a constant (so it 
can be cloned from RASTER_DATATYPE when requested by `storage_type()`).



##########
rust/sedona-schema/src/datatypes.rs:
##########
@@ -333,6 +365,1241 @@ fn deserialize_edges(edges: &Value) -> Result<Edges> {
     }
 }
 
+/// Schema for storing raster data in Apache Arrow format.
+/// Utilizing nested structs and lists to represent raster metadata and bands.
+#[derive(Debug, PartialEq, Clone)]
+pub struct RasterSchema;
+impl RasterSchema {
+    // Raster schema:
+    pub fn fields() -> Fields {
+        Fields::from(vec![
+            Field::new(column::METADATA, Self::metadata_type(), false),
+            Field::new(column::CRS, Self::crs_type(), true),
+            Field::new(column::BBOX, Self::bounding_box_type(), true),
+            Field::new(column::BANDS, Self::bands_type(), true),
+        ])
+    }
+
+    /// Raster metadata schema
+    pub fn metadata_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            // Raster dimensions
+            Field::new(column::WIDTH, DataType::UInt64, false),
+            Field::new(column::HEIGHT, DataType::UInt64, false),
+            // Geospatial transformation parameters
+            Field::new(column::UPPERLEFT_X, DataType::Float64, false),
+            Field::new(column::UPPERLEFT_Y, DataType::Float64, false),
+            Field::new(column::SCALE_X, DataType::Float64, false),
+            Field::new(column::SCALE_Y, DataType::Float64, false),
+            Field::new(column::SKEW_X, DataType::Float64, false),
+            Field::new(column::SKEW_Y, DataType::Float64, false),
+        ]))
+    }

Review Comment:
   A potential future consideration, that it might be just as fast to do 
something like:
   
   ```rust
   #[derive(Serialize, Deserialize)
   struct RasterMetadata {
     shape: [u64; 2],
     upper_left: [f64; 2],
     geo_transform: [f64; 6],
   }
   ```
   
   ...and use `serde_json` to read and write from JSON. (i.e., the `metadata` 
field would just be `Utf8`).
   
   You've done a great job abstracting this so that nobody ever has to manually 
create or read this and so it should be easy to experiment with changing it in 
the future 🙂 



##########
rust/sedona-schema/src/datatypes.rs:
##########
@@ -333,6 +365,1241 @@ fn deserialize_edges(edges: &Value) -> Result<Edges> {
     }
 }
 
+/// Schema for storing raster data in Apache Arrow format.
+/// Utilizing nested structs and lists to represent raster metadata and bands.
+#[derive(Debug, PartialEq, Clone)]
+pub struct RasterSchema;
+impl RasterSchema {
+    // Raster schema:
+    pub fn fields() -> Fields {
+        Fields::from(vec![
+            Field::new(column::METADATA, Self::metadata_type(), false),
+            Field::new(column::CRS, Self::crs_type(), true),
+            Field::new(column::BBOX, Self::bounding_box_type(), true),
+            Field::new(column::BANDS, Self::bands_type(), true),
+        ])
+    }
+
+    /// Raster metadata schema
+    pub fn metadata_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            // Raster dimensions
+            Field::new(column::WIDTH, DataType::UInt64, false),
+            Field::new(column::HEIGHT, DataType::UInt64, false),
+            // Geospatial transformation parameters
+            Field::new(column::UPPERLEFT_X, DataType::Float64, false),
+            Field::new(column::UPPERLEFT_Y, DataType::Float64, false),
+            Field::new(column::SCALE_X, DataType::Float64, false),
+            Field::new(column::SCALE_Y, DataType::Float64, false),
+            Field::new(column::SKEW_X, DataType::Float64, false),
+            Field::new(column::SKEW_Y, DataType::Float64, false),
+        ]))
+    }
+
+    /// Bounding box schema
+    pub fn bounding_box_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            Field::new(column::MIN_X, DataType::Float64, false),
+            Field::new(column::MIN_Y, DataType::Float64, false),
+            Field::new(column::MAX_X, DataType::Float64, false),
+            Field::new(column::MAX_Y, DataType::Float64, false),
+        ]))
+    }
+
+    /// Bands list schema
+    pub fn bands_type() -> DataType {
+        DataType::List(FieldRef::new(Field::new(
+            column::BAND,
+            Self::band_type(),
+            false,
+        )))
+    }
+
+    /// Individual band schema
+    pub fn band_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            Field::new(column::METADATA, Self::band_metadata_type(), false),
+            Field::new(column::DATA, Self::band_data_type(), false),
+        ]))
+    }
+
+    /// Band metadata schema
+    pub fn band_metadata_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            Field::new(column::NODATAVALUE, DataType::Binary, true), // Allow 
null nodata values
+            Field::new(column::STORAGE_TYPE, DataType::UInt32, false),
+            Field::new(column::DATATYPE, DataType::UInt32, false),
+            // OutDb reference fields - only used when storage_type == OutDbRef
+            Field::new(column::OUTDB_URL, DataType::Utf8, true),
+            Field::new(column::OUTDB_BAND_ID, DataType::UInt32, true),
+        ]))

Review Comment:
   With JSON storage this could also collapse to:
   
   ```rust
   #[derive(Serialize, Deserialize)]
   struct BandMetadata {
     nodata_value: Vec<u8>,
     storage_type: StorageType,
     data_type: BandDataType,
     #[serde(skip_serializing_if = None)]
     outdb_url: Option<String>,
     #[serde(skip_serializing_if = None)]
     outdb_band_id: Option<String>
   }
   ```
   
   (Again, this is abstracted nicely so we don't have to consider this now!)



##########
rust/sedona-schema/src/datatypes.rs:
##########
@@ -333,6 +365,1241 @@ fn deserialize_edges(edges: &Value) -> Result<Edges> {
     }
 }
 
+/// Schema for storing raster data in Apache Arrow format.
+/// Utilizing nested structs and lists to represent raster metadata and bands.
+#[derive(Debug, PartialEq, Clone)]
+pub struct RasterSchema;
+impl RasterSchema {

Review Comment:
   I think it might be nice to have this live in a dedicated 
`sedona-schema/src/raster.rs` (arguably we should move the Wkb types into their 
own file as well in some future 😬 )



##########
rust/sedona-functions/src/executor.rs:
##########
@@ -75,6 +75,116 @@ pub struct GenericExecutor<'a, 'b, Factory0, Factory1> {
 /// Alias for an executor that iterates over geometries as [Wkb]
 pub type WkbExecutor<'a, 'b> = GenericExecutor<'a, 'b, WkbGeometryFactory, 
WkbGeometryFactory>;
 
+/// Helper for writing raster kernel implementations
+///
+/// The [RasterExecutor] provides a simplified interface for executing 
functions
+/// on raster arrays, handling the common pattern of downcasting to 
StructArray,
+/// creating raster iterators, and handling null values.
+pub struct RasterExecutor<'a, 'b> {
+    pub arg_types: &'a [SedonaType],
+    pub args: &'b [ColumnarValue],
+    num_iterations: usize,
+}

Review Comment:
   I think having a `sedona-raster-functions` crate would make sense! 
(sedona-functions is already a little crowded and might benefit from being 
split even in its current form)



##########
rust/sedona-gdal/src/rs_value.rs:
##########
@@ -0,0 +1,443 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+use std::sync::Arc;
+
+use crate::dataset::outdb_dataset;
+use arrow_array::builder::Float64Builder;
+use arrow_schema::{ArrowError, DataType};
+use datafusion_common::{error::Result, scalar::ScalarValue};
+use datafusion_expr::ColumnarValue;
+use sedona_expr::scalar_udf::{ScalarKernelRef, SedonaScalarKernel};
+use sedona_functions::executor::RasterExecutor;
+use sedona_raster::datatype_functions::{bytes_per_pixel, read_pixel_value};
+use sedona_schema::datatypes::{BandMetadataRef, BandRef, RasterRef, 
SedonaType, StorageType};
+
+/// RS_Value() implementation
+pub fn rs_value_impl() -> ScalarKernelRef {
+    Arc::new(RSValue {})
+}

Review Comment:
   This is probably something we want to be an async scalar function (which 
we'd need to implement a Sedona wrapper for to get the kernel dispatch thing 
working): 
https://datafusion.apache.org/library-user-guide/functions/adding-udfs.html#adding-an-async-scalar-udf
 . That would provide a mechanism for doing the IO required in parallel and a 
way to tell DataFusion that operating in small batches is a good idea.



##########
rust/sedona-schema/src/datatypes.rs:
##########
@@ -333,6 +365,1241 @@ fn deserialize_edges(edges: &Value) -> Result<Edges> {
     }
 }
 
+/// Schema for storing raster data in Apache Arrow format.
+/// Utilizing nested structs and lists to represent raster metadata and bands.
+#[derive(Debug, PartialEq, Clone)]
+pub struct RasterSchema;
+impl RasterSchema {
+    // Raster schema:
+    pub fn fields() -> Fields {
+        Fields::from(vec![
+            Field::new(column::METADATA, Self::metadata_type(), false),
+            Field::new(column::CRS, Self::crs_type(), true),
+            Field::new(column::BBOX, Self::bounding_box_type(), true),
+            Field::new(column::BANDS, Self::bands_type(), true),
+        ])
+    }
+
+    /// Raster metadata schema
+    pub fn metadata_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            // Raster dimensions
+            Field::new(column::WIDTH, DataType::UInt64, false),
+            Field::new(column::HEIGHT, DataType::UInt64, false),
+            // Geospatial transformation parameters
+            Field::new(column::UPPERLEFT_X, DataType::Float64, false),
+            Field::new(column::UPPERLEFT_Y, DataType::Float64, false),
+            Field::new(column::SCALE_X, DataType::Float64, false),
+            Field::new(column::SCALE_Y, DataType::Float64, false),
+            Field::new(column::SKEW_X, DataType::Float64, false),
+            Field::new(column::SKEW_Y, DataType::Float64, false),
+        ]))
+    }
+
+    /// Bounding box schema
+    pub fn bounding_box_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            Field::new(column::MIN_X, DataType::Float64, false),
+            Field::new(column::MIN_Y, DataType::Float64, false),
+            Field::new(column::MAX_X, DataType::Float64, false),
+            Field::new(column::MAX_Y, DataType::Float64, false),
+        ]))
+    }
+
+    /// Bands list schema
+    pub fn bands_type() -> DataType {
+        DataType::List(FieldRef::new(Field::new(
+            column::BAND,
+            Self::band_type(),
+            false,
+        )))
+    }
+
+    /// Individual band schema
+    pub fn band_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            Field::new(column::METADATA, Self::band_metadata_type(), false),
+            Field::new(column::DATA, Self::band_data_type(), false),
+        ]))
+    }
+
+    /// Band metadata schema
+    pub fn band_metadata_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            Field::new(column::NODATAVALUE, DataType::Binary, true), // Allow 
null nodata values
+            Field::new(column::STORAGE_TYPE, DataType::UInt32, false),
+            Field::new(column::DATATYPE, DataType::UInt32, false),
+            // OutDb reference fields - only used when storage_type == OutDbRef
+            Field::new(column::OUTDB_URL, DataType::Utf8, true),
+            Field::new(column::OUTDB_BAND_ID, DataType::UInt32, true),
+        ]))
+    }
+
+    /// Band data schema (single binary blob)
+    pub fn band_data_type() -> DataType {
+        DataType::Binary // consider switching to BinaryView
+    }
+
+    /// CRS schema to store json representation
+    pub fn crs_type() -> DataType {
+        DataType::Utf8 // TODO: Consider Utf8View
+    }
+}
+
+#[repr(u16)]
+#[derive(Clone, Debug, PartialEq, Eq)]
+pub enum BandDataType {
+    UInt8 = 0,
+    UInt16 = 1,
+    Int16 = 2,
+    UInt32 = 3,
+    Int32 = 4,
+    Float32 = 5,
+    Float64 = 6,
+    // Consider support for complex types for scientific data
+}
+
+/// Storage strategy for raster band data within Apache Arrow arrays.
+///
+/// This enum defines how raster data is physically stored and accessed:
+///
+/// **InDb**: Raster data is embedded directly in the Arrow array as binary 
blobs.
+///   - Self-contained, no external dependencies, fast access for small-medium 
rasters
+///   - Increases Arrow array size, memory usage grows and copy times increase 
with raster size
+///   - Best for: Tiles, thumbnails, processed results, small rasters (<10MB 
per band)
+///
+/// **OutDbRef**: Raster data is stored externally with references in the 
Arrow array.
+///   - Keeps Arrow arrays lightweight, supports massive rasters, enables lazy 
loading
+///   - Requires external storage management, potential for broken references
+///   - Best for: Large satellite imagery, time series data, cloud-native 
workflows
+///   - Supported backends: S3, GCS, Azure Blob, local filesystem, HTTP 
endpoints
+#[repr(u16)]
+#[derive(Clone, Debug, PartialEq, Eq)]
+pub enum StorageType {
+    InDb = 0,
+    OutDbRef = 1,
+}
+
+/// Builder for constructing raster arrays with zero-copy band data writing
+pub struct RasterBuilder {
+    main_builder: StructBuilder,
+}

Review Comment:
   Nice! I think your sedona-raster crate would be a good home for this one. I 
know our existing sedona-geometry doesn't depend on Arrow but I think that's 
fine for sedona-raster (we can move these things around if there's good reason 
to do so later on).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to