scovich commented on code in PR #8521:
URL: https://github.com/apache/arrow-rs/pull/8521#discussion_r2396006429


##########
parquet-variant-compute/src/arrow_to_variant.rs:
##########
@@ -38,6 +38,98 @@ use parquet_variant::{
 use std::collections::HashMap;
 use std::ops::Range;
 
+// ============================================================================
+// Shared traits and helpers for Arrow-to-Variant conversion
+// ============================================================================
+
+/// Zero-cost trait for converting Arrow array values to Variant
+pub(crate) trait ArrowToVariant: Array {
+    fn append_to_variant_builder(
+        &self,
+        builder: &mut impl VariantBuilderExt,
+        index: usize,
+    ) -> Result<(), ArrowError>;
+}
+
+/// Macro to define ArrowToVariant implementations with optional value 
transformation
+macro_rules! define_arrow_to_variant {

Review Comment:
   > There seems to be two major pieces of functionality
   > 
   >     1. Converting elements of Arrow Arrays to the corresponding Variant 
elements
   > 
   >     2. Converting Variant arrays back to Arrow Arrays
   > 
   
   I agree this is the intuitive view, and arrow_to_variant and unshred_variant 
both fall loosely in category 1/.
   
   > It seems to me like `arrow_to_variant.rs` and `unshred_variant.rs` still 
have non trivial overlap as you mention. What I was hoping was that we could 
somehow use one set of traits / structs for both those operations which would 
mean we would get `unshred` support "for free" for other types like 
`UnionArray`s
   > 
   > That was the dream at anyways.
   
   So this is tricky -- shredded variant `typed_value` columns must be one of 
the supported variant shredding types defined by the [shredding 
spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types).
 
   
   Unsigned integer types are not on that list, so we'll never need to unshred 
them (even tho we can shred them, and can even `variant_get` them by converting 
back).
   
   Complex types like Union and Map are also not on that list and so we'll 
never need to unshred them. But we can still convert them to variant: Whatever 
union branch is active for each row gets converted to variant, which works 
fine; maps are trickier -- I think our current code forcibly converts the map 
key column to string (with a `cast` 🙀) and then converts the result to variant 
object.
   
   FixedLenBinary is also tricky, because it's not a valid shredded variant 
type, but UUID uses `FixedLenBinary(16)` as its physical type. So when 
converting to variant, all binary types (including fixed len) convert to 
`Variant::Binary` _except_ that `FixedLenBinary(16)` with the UUID extension 
type would convert to `Variant::Uuid`.
   
   So one immediate problem is that the two operations have an overlapping but 
not equivalent set of types. And some physical types that seem the same have 
different interpretation/semantics. Even the simplest -- the NULL builder -- 
has different semantics between the two operations.
   
   Another problem is that the definition of "primitive type" differs between 
the two modules:
   * `unshred_variant` takes the variant perspective, so string, binary, and 
boolean arrays can all implement the generic `AppendToVariantBuilder` trait 
that `UnshredPrimitiveRowBuilder` relies on. But timestamps, which need extra 
state (timezone info) are _not_ primitive and need their own builder.
   * `arrow_to_variant` takes the arrow perspective, so string, binary and 
boolean are _not_ primitive types and thus need their own builder 
implementations. Additionally, all decimal and temporal types need special 
treatment and they get customer builders as well.
   
   Another problem is the need to handle `value` column when unshredding, which 
is _not_ needed when converting strongly typed data to variant.
   
   Overall... I couldn't find a way to slice this better, in spite of my 
intuition screaming it should be possible.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to