friendlymatthew commented on code in PR #7915:
URL: https://github.com/apache/arrow-rs/pull/7915#discussion_r2203418536
##########
parquet-variant/src/builder.rs:
##########
@@ -350,14 +378,59 @@ impl<S: AsRef<str>> FromIterator<S> for MetadataBuilder {
}
}
-impl<S: AsRef<str>> Extend<S> for MetadataBuilder {
+impl<S: AsRef<str>> Extend<S> for DefaultMetadataBuilder {
fn extend<T: IntoIterator<Item = S>>(&mut self, iter: T) {
for field_name in iter {
self.upsert_field_name(field_name.as_ref());
}
}
}
+/// Read-only metadata builder that validates field names against an existing
metadata dictionary
Review Comment:
Hi, how do we plan on storing `typed_value`s? Do we plan on encoding it as a
`Variant` and storing the binary representation of it? Or since it is supposed
to be strongly typed, we want to map to specific `DataType`s, which would
require something like https://github.com/apache/arrow-rs/pull/7921
I've been playing around with the unshredding logic and the most naive
version would look something like this. I don't mean this to be the actual
implementation, but I am curious how we plan on representing `typed_value`.
<details>
<summary>naive reconstruct logic</summary>
```rs
pub fn reconstruct_variant(
metadata: &[u8],
value: Option<&[u8]>,
typed_value: Option<(&[u8], &[u8])>, // this is itself a variant
) -> Result<(Vec<u8>, Vec<u8>), ArrowError> {
match typed_value {
Some((typed_metadata, typed_value)) => {
let variant = Variant::try_new(typed_metadata, typed_value)?;
match variant {
Variant::Null
| Variant::Int8(_)
| Variant::Int16(_)
| Variant::Int32(_)
| Variant::Int64(_)
| Variant::Date(_)
| Variant::TimestampMicros(_)
| Variant::TimestampNtzMicros(_)
| Variant::Decimal4(_)
| Variant::Decimal8(_)
| Variant::Decimal16(_)
| Variant::Float(_)
| Variant::Double(_)
| Variant::BooleanTrue
| Variant::BooleanFalse
| Variant::Binary(_)
| Variant::String(_)
| Variant::ShortString(_) => Ok((typed_metadata.to_vec(),
typed_value.to_vec())),
Variant::Object(shredded_object) => {
if let Some(value) = value {
let variant = Variant::try_new(metadata, value)?;
let partially_shredded_object =
variant.as_object().ok_or(ArrowError::InvalidArgumentError(
"partially shredded value must be an
object".to_string(),
))?;
let shredded_keys: HashSet<&str> =
HashSet::from_iter(shredded_object.iter().map(|(k, _)| k));
let partially_shredded_keys: HashSet<&str> =
HashSet::from_iter(partially_shredded_object.iter().map(|(k, _)| k));
if
!shredded_keys.is_disjoint(&partially_shredded_keys) {
return Err(ArrowError::InvalidArgumentError(
"object keys must be disjoint".to_string(),
));
}
// union the two objects together
let mut variant_builder = VariantBuilder::new();
let mut object_builder =
variant_builder.new_object();
for (f, v) in shredded_object.iter() {
object_builder.insert(f, v);
}
for (f, v) in partially_shredded_object.iter() {
object_builder.insert(f, v);
}
object_builder.finish()?;
return Ok(variant_builder.finish());
}
Ok((typed_metadata.to_vec(), typed_value.to_vec()))
}
Variant::List(_) => {
if value.is_some() {
return Err(ArrowError::InvalidArgumentError(
"shredded array must not conflict with variant
value".to_string(),
));
}
return Ok((typed_metadata.to_vec(),
typed_value.to_vec()));
}
}
}
None => match value {
Some(value) => Ok((metadata.to_vec(), value.to_vec())),
None => Err(ArrowError::InvalidArgumentError(
"No value or typed value provided".to_string(),
)),
},
}
}
```
</details>
If we strongly type `typed_value` to whatever the variant type is, the
primitive variants work nicely, since we'd be storing a `PrimitiveArray<T>` and
we can easily index into it.
For complex variants like objects/lists, I'm having a hard time figuring out
how to strongly type these values. Maybe for these cases we should just store
the binary encoded version of a Variant? (metadata, value).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]