Re: [PR] Add builder to help create Schemas for shredding (`ShredTypeBuilder`) [arrow-rs]

via GitHub Thu, 04 Dec 2025 05:48:33 -0800


mhilton commented on code in PR #8940:
URL: https://github.com/apache/arrow-rs/pull/8940#discussion_r2589112859



##########
parquet-variant-compute/src/shred_variant.rs:
##########
@@ -348,6 +349,139 @@ impl<'a> VariantToShreddedObjectVariantRowBuilder<'a> {
     }
 }
 
+fn split_variant_path(path: &str) -> Vec<String> {
+    path.split('.')
+        .filter(|segment| !segment.is_empty())
+        .map(|segment| segment.to_string())
+        .collect()
+}
+
+/// Builder for constructing a variant shredding schema.
+///
+/// The builder pattern makes it easy to incrementally define which fields
+/// should be shredded and with what types.
+///
+/// # Example
+///
+/// ```
+/// use arrow::datatypes::DataType;
+/// use parquet_variant_compute::ShredTypeBuilder;
+///
+/// // Define the shredding schema using the builder
+/// let shredding_type = ShredTypeBuilder::default()
+///     .with_path("time", &DataType::Int64)
+///     .with_path("hostname", &DataType::Utf8)
+///     .with_path("metrics.cpu", &DataType::Float64)
+///     .with_path("metrics.memory", &DataType::Float64)
+///     .build();
+///
+/// // The shredding_type can now be passed to shred_variant:
+/// // let shredded = shred_variant(&input, &shredding_type)?;
+/// ```
+#[derive(Default, Clone)]
+pub struct ShredTypeBuilder {
+    root: VariantSchemaNode,
+}
+
+impl ShredTypeBuilder {
+    /// Create a new empty schema builder.
+    pub fn new() -> Self {
+        Self::default()
+    }
+
+    /// Insert a typed path into the schema.
+    ///
+    /// The path uses dot notation to specify nested fields.
+    /// For example, "a.b.c" will create a nested structure.
+    ///
+    /// # Arguments
+    ///
+    /// * `path` - The dot-separated path (e.g., "user.name" or "metrics.cpu")
+    /// * `data_type` - The Arrow data type for this field
+    pub fn with_path(mut self, path: &str, data_type: &DataType) -> Self {
+        let segments = split_variant_path(path);
+        self.root.insert_path(&segments, data_type);
+        self
+    }
+
+    /// Build the final [`DataType`].
+    pub fn build(self) -> DataType {
+        let shredding_type = self.root.to_shredding_type();
+        match shredding_type {
+            Some(shredding_type) => shredding_type,
+            None => DataType::Null,
+        }
+    }
+}
+
+/// Internal tree node structure for building variant schemas.
+#[derive(Clone)]
+enum VariantSchemaNode {

Review Comment:
   It'd be nice if this type also supported an Array type too. Although the 
path syntax would need to be extended a little. Possibly you could use `.0` to 
indicate an array?



##########
parquet-variant-compute/src/shred_variant.rs:
##########
@@ -348,6 +349,139 @@ impl<'a> VariantToShreddedObjectVariantRowBuilder<'a> {
     }
 }
 
+fn split_variant_path(path: &str) -> Vec<String> {
+    path.split('.')
+        .filter(|segment| !segment.is_empty())
+        .map(|segment| segment.to_string())
+        .collect()
+}
+
+/// Builder for constructing a variant shredding schema.
+///
+/// The builder pattern makes it easy to incrementally define which fields
+/// should be shredded and with what types.
+///
+/// # Example
+///
+/// ```
+/// use arrow::datatypes::DataType;
+/// use parquet_variant_compute::ShredTypeBuilder;
+///
+/// // Define the shredding schema using the builder
+/// let shredding_type = ShredTypeBuilder::default()
+///     .with_path("time", &DataType::Int64)
+///     .with_path("hostname", &DataType::Utf8)
+///     .with_path("metrics.cpu", &DataType::Float64)
+///     .with_path("metrics.memory", &DataType::Float64)
+///     .build();
+///
+/// // The shredding_type can now be passed to shred_variant:
+/// // let shredded = shred_variant(&input, &shredding_type)?;
+/// ```
+#[derive(Default, Clone)]
+pub struct ShredTypeBuilder {
+    root: VariantSchemaNode,
+}
+
+impl ShredTypeBuilder {
+    /// Create a new empty schema builder.
+    pub fn new() -> Self {
+        Self::default()
+    }
+
+    /// Insert a typed path into the schema.
+    ///
+    /// The path uses dot notation to specify nested fields.
+    /// For example, "a.b.c" will create a nested structure.
+    ///
+    /// # Arguments
+    ///
+    /// * `path` - The dot-separated path (e.g., "user.name" or "metrics.cpu")
+    /// * `data_type` - The Arrow data type for this field
+    pub fn with_path(mut self, path: &str, data_type: &DataType) -> Self {

Review Comment:
   Could there also be a version of this function that takes a &VariantPath as 
the first input? That way if one has key  names with a "." in it would still be 
possible to use this builder.



##########
parquet-variant-compute/src/shred_variant.rs:
##########
@@ -348,6 +349,139 @@ impl<'a> VariantToShreddedObjectVariantRowBuilder<'a> {
     }
 }
 
+fn split_variant_path(path: &str) -> Vec<String> {
+    path.split('.')
+        .filter(|segment| !segment.is_empty())
+        .map(|segment| segment.to_string())
+        .collect()
+}
+
+/// Builder for constructing a variant shredding schema.
+///
+/// The builder pattern makes it easy to incrementally define which fields
+/// should be shredded and with what types.
+///
+/// # Example
+///
+/// ```
+/// use arrow::datatypes::DataType;
+/// use parquet_variant_compute::ShredTypeBuilder;
+///
+/// // Define the shredding schema using the builder
+/// let shredding_type = ShredTypeBuilder::default()
+///     .with_path("time", &DataType::Int64)

Review Comment:
   It seems to me that this would be a good place it illustrate that one can 
use some more complex types such as `DataType::Timestamp(TimeUnit::Nanoseconds, 
Some(Arc::from("UTC")))` (this particular type might not be available until 
#8950 merges).



##########
parquet-variant-compute/src/shred_variant.rs:
##########
@@ -348,6 +349,139 @@ impl<'a> VariantToShreddedObjectVariantRowBuilder<'a> {
     }
 }
 
+fn split_variant_path(path: &str) -> Vec<String> {
+    path.split('.')
+        .filter(|segment| !segment.is_empty())
+        .map(|segment| segment.to_string())
+        .collect()
+}
+
+/// Builder for constructing a variant shredding schema.
+///
+/// The builder pattern makes it easy to incrementally define which fields
+/// should be shredded and with what types.
+///
+/// # Example
+///
+/// ```
+/// use arrow::datatypes::DataType;
+/// use parquet_variant_compute::ShredTypeBuilder;
+///
+/// // Define the shredding schema using the builder
+/// let shredding_type = ShredTypeBuilder::default()
+///     .with_path("time", &DataType::Int64)
+///     .with_path("hostname", &DataType::Utf8)
+///     .with_path("metrics.cpu", &DataType::Float64)
+///     .with_path("metrics.memory", &DataType::Float64)
+///     .build();
+///
+/// // The shredding_type can now be passed to shred_variant:
+/// // let shredded = shred_variant(&input, &shredding_type)?;
+/// ```
+#[derive(Default, Clone)]
+pub struct ShredTypeBuilder {
+    root: VariantSchemaNode,
+}
+
+impl ShredTypeBuilder {
+    /// Create a new empty schema builder.
+    pub fn new() -> Self {
+        Self::default()
+    }
+
+    /// Insert a typed path into the schema.
+    ///
+    /// The path uses dot notation to specify nested fields.
+    /// For example, "a.b.c" will create a nested structure.
+    ///
+    /// # Arguments
+    ///
+    /// * `path` - The dot-separated path (e.g., "user.name" or "metrics.cpu")
+    /// * `data_type` - The Arrow data type for this field
+    pub fn with_path(mut self, path: &str, data_type: &DataType) -> Self {

Review Comment:
   Out of interest if one was to have a version using a `FieldRef` what should 
the behaviour be if the name in the field doesn't match the path string.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add builder to help create Schemas for shredding (`ShredTypeBuilder`) [arrow-rs]

Reply via email to