scovich commented on code in PR #8600:
URL: https://github.com/apache/arrow-rs/pull/8600#discussion_r2429852073
##########
parquet-variant-compute/src/variant_to_arrow.rs:
##########
@@ -52,6 +53,12 @@ pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> {
TimestampNano(VariantToTimestampArrowRowBuilder<'a,
datatypes::TimestampNanosecondType>),
TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a,
datatypes::TimestampNanosecondType>),
Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>),
+ StringView(VariantToUtf8ViewArrowBuilder<'a>),
Review Comment:
I don't see any meaningful call sites that pass a data capacity -- only some
unit tests.
Ultimately, `variant_get` will call `make_variant_to_arrow_row_builder`, and
I don't think that code has any way to predict what the correct data capacity
might be? How could one even define "correct" when a single value would be
applied to each of potentially many string row builders that will be created,
when each of those builders could see a completely different distribution of
string sizes and null values?
This is very different from the row capacity value, which _IS_ precisely
known and applies equally to all builders `variant_get` might need to create.
Also -- these capacities are just pre-allocation hints; passing too large a
hint temporarily wastes a bit of memory, and passing too small a hint just
means one or more internal reallocations.
I would vote to just choose a reasonable default "average string size" and
multiply that by the row count to obtain a data capacity hint when needed.
TBD whether that average string size should be a parameter that originates
with the caller of `variant_get` and gets plumbed all the way through -- but
that seems like a really invasive API change for very little benefit. Seems
like a simple `const` would be much better.
##########
parquet-variant-compute/src/variant_to_arrow.rs:
##########
@@ -133,6 +142,33 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> {
TimestampNano(b) => b.finish(),
TimestampNanoNtz(b) => b.finish(),
Date(b) => b.finish(),
+ StringView(b) => b.finish(),
+ }
+ }
+}
+
+impl<'a> StringVariantToArrowRowBuilder<'a> {
+ pub fn append_null(&mut self) -> Result<()> {
+ use StringVariantToArrowRowBuilder::*;
+ match self {
+ Utf8(b) => b.append_null(),
+ LargeUtf8(b) => b.append_null(),
+ }
+ }
Review Comment:
I don't see any string-specific logic that would merit a nested enum like
this?
Can we make this builder generic and use it in two new variants of the
top-level enum?
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -763,6 +764,13 @@ mod test {
BooleanArray::from(vec![Some(true), Some(false), Some(true)])
);
+ perfectly_shredded_to_arrow_primitive_test!(
+ get_variant_perfectly_shredded_utf8_as_utf8,
+ DataType::Utf8,
Review Comment:
That would be from the `VariantArray` constructor, which invokes this code:
```rust
fn canonicalize_and_verify_data_type(
data_type: &DataType,
) -> Result<Cow<'_, DataType>, ArrowError> {
...
let new_data_type = match data_type {
...
// We can _possibly_ allow (some of) these some day?
LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) |
LargeListView(_) => {
fail!()
}
```
I originally added that code because I was not confident I knew what the
correct behavior should be. The [shredding
spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types)
says:
> Shredded values must use the following Parquet types:
>
> | Variant Type | Parquet Physical Type |
Parquet Logical Type |
>
|-----------------------------|-----------------------------------|--------------------------|
> | ... |||
> | binary | BINARY |
|
> | string | BINARY | STRING
|
> | ... |||
> | array | GROUP; see Arrays below | LIST
|
But I'm _pretty_ sure that doesn't need to constrain the use of
`DataType::Utf8` vs. `DataType::LargeUtf8` vs `DataType::Utf8Vew`? (similar
story for the various in-memory layouts of lists and binary values)?
A similar dilemma is that the `metadata` column is supposed to be parquet
BINARY type, but arrow-parquet produces `BinaryViewArray` by default. Right now
we replace `DataType::Binary` with `DataType::BinaryView` and force a cast as
needed.
If we think the shredding spec forbids `LargeUtf8` or `Utf8View` then we
probably need to cast binary views back to normal binary as well.
If we don't think the shredding spec forbids those types, then we should
probably support `metadata: LargeBinaryArray` (tho the narrowing cast to
`BinaryArray` might fail if the offsets really don't fit in 32 bits).
@alamb @cashmand, any advice here?
##########
parquet-variant-compute/src/variant_to_arrow.rs:
##########
@@ -401,6 +481,27 @@ macro_rules! define_variant_to_primitive_builder {
}
}
+define_variant_to_primitive_builder!(
+ struct VariantToUtf8ArrowRowBuilder<'a>
+ |item_capacity, data_capacity: usize| -> StringBuilder {
StringBuilder::with_capacity(item_capacity, data_capacity) },
+ |value| value.as_string(),
+ type_name: "Utf8"
+);
+
+define_variant_to_primitive_builder!(
+ struct VariantToLargeUtf8ArrowBuilder<'a>
+ |item_capacity, data_capacity: usize| -> LargeStringBuilder
{LargeStringBuilder::with_capacity(item_capacity, data_capacity)},
+ |value| value.as_string(),
+ type_name: "LargeUtf8"
+);
+
+define_variant_to_primitive_builder!(
+ struct VariantToUtf8ViewArrowBuilder<'a>
+ |capacity| -> StringViewBuilder
{StringViewBuilder::with_capacity(capacity)},
+ |value| value.as_string(),
+ type_name: "Utf8View"
+);
Review Comment:
Check out the `ListLikeArray` trait in arrow_to_variant.rs -- I suspect a
`StringLikeArrayBuilder` trait would be very helpful here, because the "shape"
of all the string arrays is very similar even tho the specific array types and
possibly some method names might differ).
```suggestion
define_variant_to_primitive_builder!(
struct VariantToStringArrowBuilder<'a, B: StringLikeArrayBuilder>
|capacity| -> B { B::with_capacity(capacity) },
|value| value.as_string(),
type_name: B::type_name()
);
```
where
```rust
trait StringLikeArrayBuilder: ArrayBuilder {
fn type_name() -> &'static str;
fn with_capacity(capacity: usize) -> Self;
fn append_value(&mut self, value: &str);
fn append_null(&mut self);
}
impl StringLikeArrayBuilder for StringViewBuilder {
...
}
impl<O: OffsetSizeTrait> StringLikeArrayBuilder for GenericStringBuilder<O> {
...
fn with_capacity(capacity: usize) -> Self {
Self::with_capacity(capacity, capacity*AVERAGE_STRING_LENGTH)
}
...
}
```
As noted in a different comment, we don't have any meaningful way to predict
the needed data_capacity of a string builder, so IMO `<GenericStringBuilder as
StringLikeArrayBuilder>::with_capacity` should just scale the item capacity by
some reasonable guess at the average string size, and call that the data
capacity.
##########
parquet-variant-compute/src/variant_to_arrow.rs:
##########
@@ -52,6 +53,12 @@ pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> {
TimestampNano(VariantToTimestampArrowRowBuilder<'a,
datatypes::TimestampNanosecondType>),
TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a,
datatypes::TimestampNanosecondType>),
Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>),
+ StringView(VariantToUtf8ViewArrowBuilder<'a>),
Review Comment:
A big benefit of the simpler approach to data capacity: All the string
builders are, in fact, primitive builders (see the macro invocations below) --
so we can just add three new enum variants to the primitive row builder enum
and call it done.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]