scovich commented on code in PR #8521:
URL: https://github.com/apache/arrow-rs/pull/8521#discussion_r2396006429
##########
parquet-variant-compute/src/arrow_to_variant.rs:
##########
@@ -38,6 +38,98 @@ use parquet_variant::{
use std::collections::HashMap;
use std::ops::Range;
+// ============================================================================
+// Shared traits and helpers for Arrow-to-Variant conversion
+// ============================================================================
+
+/// Zero-cost trait for converting Arrow array values to Variant
+pub(crate) trait ArrowToVariant: Array {
+ fn append_to_variant_builder(
+ &self,
+ builder: &mut impl VariantBuilderExt,
+ index: usize,
+ ) -> Result<(), ArrowError>;
+}
+
+/// Macro to define ArrowToVariant implementations with optional value
transformation
+macro_rules! define_arrow_to_variant {
Review Comment:
> There seems to be two major pieces of functionality
>
> 1. Converting elements of Arrow Arrays to the corresponding Variant
elements
>
> 2. Converting Variant arrays back to Arrow Arrays
>
I agree this is the intuitive view, and arrow_to_variant and unshred_variant
both fall loosely in category 1/.
> It seems to me like `arrow_to_variant.rs` and `unshred_variant.rs` still
have non trivial overlap as you mention. What I was hoping was that we could
somehow use one set of traits / structs for both those operations which would
mean we would get `unshred` support "for free" for other types like
`UnionArray`s
>
> That was the dream at anyways.
So this is tricky -- shredded variant `typed_value` columns must be one of
the supported variant shredding types defined by the [shredding
spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types).
Unsigned integer types are not on that list, so we'll never need to unshred
them (even tho we can shred them, and can even `variant_get` them by converting
back).
Complex types like Union and Map are also not on that list and so we'll
never need to unshred them. But we can still convert them to variant: Whatever
union branch is active for each row gets converted to variant, which works
fine; maps are trickier -- I think our current code forcibly converts the map
key column to string (with a `cast` 🙀) and then converts the result to variant
object.
FixedLenBinary is also tricky, because it's not a valid shredded variant
type, but UUID uses `FixedLenBinary(16)` as its physical type. So when
converting to variant, all binary types (including fixed len) convert to
`Variant::Binary` _except_ that `FixedLenBinary(16)` with the UUID extension
type would convert to `Variant::Uuid`.
So one immediate problem is that the two operations have an overlapping but
not equivalent set of types. And some physical types that seem the same have
different interpretation/semantics. Even the simplest -- the NULL builder --
has different semantics between the two operations.
Another problem is that the definition of "primitive type" differs between
the two modules:
* `unshred_variant` takes the variant perspective, so string, binary, and
boolean arrays can all implement the generic `AppendToVariantBuilder` trait
that `UnshredPrimitiveRowBuilder` relies on. But timestamps, which need extra
state (timezone info) are _not_ primitive and need their own builder.
* `arrow_to_variant` takes the arrow perspective, so string, binary and
boolean are _not_ primitive types and thus need their own builder
implementations. Additionally, all decimal and temporal types need special
treatment and they get customer builders as well.
Another problem is the need to handle `value` column when unshredding, which
is _not_ needed when converting strongly typed data to variant.
Overall... I couldn't find a way to slice this better, in spite of my
intuition screaming it should be possible.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]