Re: [PR] Variant to arrow utf8 [arrow-rs]

via GitHub Sat, 18 Oct 2025 14:43:16 -0700


scovich commented on code in PR #8600:
URL: https://github.com/apache/arrow-rs/pull/8600#discussion_r2429852073



##########
parquet-variant-compute/src/variant_to_arrow.rs:
##########
@@ -52,6 +53,12 @@ pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> {
     TimestampNano(VariantToTimestampArrowRowBuilder<'a, 
datatypes::TimestampNanosecondType>),
     TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, 
datatypes::TimestampNanosecondType>),
     Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>),
+    StringView(VariantToUtf8ViewArrowBuilder<'a>),

Review Comment:
   I don't see any meaningful call sites that pass a data capacity -- only some 
unit tests.
   
   Ultimately, `variant_get` will call `make_variant_to_arrow_row_builder`, and 
I don't think that code has any way to predict what the correct data capacity 
might be? How could one even define "correct" when a single value would be 
applied to each of potentially many string row builders that will be created, 
when each of those builders could see a completely different distribution of 
string sizes and null values?
   
   This is very different from the row capacity value, which _IS_ precisely 
known and applies equally to all builders `variant_get` might need to create.
   
   Also -- these capacities are just pre-allocation hints; passing too large a 
hint temporarily wastes a bit of memory, and passing too small a hint just 
means one or more internal reallocations.
   
   I would vote to just choose a reasonable default "average string size" and 
multiply that by the row count to obtain a data capacity hint when needed. 
   
   TBD whether that average string size should be a parameter that originates 
with the caller of `variant_get` and gets plumbed all the way through -- but 
that seems like a really invasive API change for very little benefit. Seems 
like a simple `const` would be much better.



##########
parquet-variant-compute/src/variant_to_arrow.rs:
##########
@@ -133,6 +142,33 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> {
             TimestampNano(b) => b.finish(),
             TimestampNanoNtz(b) => b.finish(),
             Date(b) => b.finish(),
+            StringView(b) => b.finish(),
+        }
+    }
+}
+
+impl<'a> StringVariantToArrowRowBuilder<'a> {
+    pub fn append_null(&mut self) -> Result<()> {
+        use StringVariantToArrowRowBuilder::*;
+        match self {
+            Utf8(b) => b.append_null(),
+            LargeUtf8(b) => b.append_null(),
+        }
+    }

Review Comment:
   I don't see any string-specific logic that would merit a nested enum like 
this? 
   Can we make this builder generic and use it in two new variants of the 
top-level enum?



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -763,6 +764,13 @@ mod test {
         BooleanArray::from(vec![Some(true), Some(false), Some(true)])
     );
 
+    perfectly_shredded_to_arrow_primitive_test!(
+        get_variant_perfectly_shredded_utf8_as_utf8,
+        DataType::Utf8,

Review Comment:
   That would be from the `VariantArray` constructor, which invokes this code:
   ```rust
   fn canonicalize_and_verify_data_type(
       data_type: &DataType,
   ) -> Result<Cow<'_, DataType>, ArrowError> {
         ...
       let new_data_type = match data_type {
             ...
           // We can _possibly_ allow (some of) these some day?                 
                                                                                
                                                                                
                                  
           LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | 
LargeListView(_) => {
               fail!()
           }
   ```
   
   I originally added that code because I was not confident I knew what the 
correct behavior should be. The [shredding 
spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types)
 says:
   > Shredded values must use the following Parquet types:
   > 
   > | Variant Type                | Parquet Physical Type             | 
Parquet Logical Type     |
   > 
|-----------------------------|-----------------------------------|--------------------------|
   > | ... |||
   > | binary                      | BINARY                            |        
                  |
   > | string                      | BINARY                            | STRING 
                  |
   > | ... |||
   > | array                       | GROUP; see Arrays below           | LIST   
                  |
   
   But I'm _pretty_ sure that doesn't need to constrain the use of 
`DataType::Utf8` vs. `DataType::LargeUtf8` vs `DataType::Utf8Vew`? (similar 
story for the various in-memory layouts of lists and binary values)?
   
   A similar dilemma is that the `metadata` column is supposed to be parquet 
BINARY type, but arrow-parquet produces `BinaryViewArray` by default. Right now 
we replace `DataType::Binary` with `DataType::BinaryView` and force a cast as 
needed. 
   
   If we think the shredding spec forbids `LargeUtf8` or `Utf8View` then we 
probably need to cast binary views back to normal binary as well.
   
   If we don't think the shredding spec forbids those types, then we should 
probably support `metadata: LargeBinaryArray` (tho the narrowing cast to 
`BinaryArray` might fail if the offsets really don't fit in 32 bits).
   
   @alamb @cashmand, any advice here?



##########
parquet-variant-compute/src/variant_to_arrow.rs:
##########
@@ -401,6 +481,27 @@ macro_rules! define_variant_to_primitive_builder {
     }
 }
 
+define_variant_to_primitive_builder!(
+    struct VariantToUtf8ArrowRowBuilder<'a>
+    |item_capacity, data_capacity: usize| -> StringBuilder { 
StringBuilder::with_capacity(item_capacity, data_capacity) },
+    |value| value.as_string(),
+    type_name: "Utf8"
+);
+
+define_variant_to_primitive_builder!(
+    struct VariantToLargeUtf8ArrowBuilder<'a>
+    |item_capacity, data_capacity: usize| -> LargeStringBuilder 
{LargeStringBuilder::with_capacity(item_capacity, data_capacity)},
+    |value| value.as_string(),
+    type_name: "LargeUtf8"
+);
+
+define_variant_to_primitive_builder!(
+    struct VariantToUtf8ViewArrowBuilder<'a>
+    |capacity| -> StringViewBuilder 
{StringViewBuilder::with_capacity(capacity)},
+    |value| value.as_string(),
+    type_name: "Utf8View"
+);

Review Comment:
   Check out the `ListLikeArray` trait in arrow_to_variant.rs -- I suspect a 
`StringLikeArrayBuilder` trait would be very helpful here, because the "shape" 
of all the string arrays is very similar even tho the specific array types and 
possibly some method names might differ).
   
   ```suggestion
   define_variant_to_primitive_builder!(
       struct VariantToStringArrowBuilder<'a, B: StringLikeArrayBuilder>
       |capacity| -> B { B::with_capacity(capacity) },
       |value| value.as_string(),
       type_name: B::type_name()
   );
   ```
   
   where
   ```rust
   trait StringLikeArrayBuilder: ArrayBuilder {
       fn type_name() -> &'static str;
       fn with_capacity(capacity: usize) -> Self;
       fn append_value(&mut self, value: &str);
       fn append_null(&mut self);
   }
   
   impl StringLikeArrayBuilder for StringViewBuilder {
         ...
   }
   
   impl<O: OffsetSizeTrait> StringLikeArrayBuilder for GenericStringBuilder<O> {
         ...
       fn with_capacity(capacity: usize) -> Self {
           Self::with_capacity(capacity, capacity*AVERAGE_STRING_LENGTH)
       }
         ...
   }
   ```
   As noted in a different comment, we don't have any meaningful way to predict 
the needed data_capacity of a string builder, so IMO `<GenericStringBuilder as 
StringLikeArrayBuilder>::with_capacity` should just scale the item capacity by 
some reasonable guess at the average string size, and call that the data 
capacity.



##########
parquet-variant-compute/src/variant_to_arrow.rs:
##########
@@ -52,6 +53,12 @@ pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> {
     TimestampNano(VariantToTimestampArrowRowBuilder<'a, 
datatypes::TimestampNanosecondType>),
     TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, 
datatypes::TimestampNanosecondType>),
     Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>),
+    StringView(VariantToUtf8ViewArrowBuilder<'a>),

Review Comment:
   A big benefit of the simpler approach to data capacity: All the string 
builders are, in fact, primitive builders (see the macro invocations below) -- 
so we can just add three new enum variants to the primitive row builder enum 
and call it done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Variant to arrow utf8 [arrow-rs]

Reply via email to