alamb commented on code in PR #8600:
URL: https://github.com/apache/arrow-rs/pull/8600#discussion_r2449475543
##########
arrow-array/src/builder/generic_bytes_builder.rs:
##########
@@ -348,6 +348,50 @@ impl<O: OffsetSizeTrait> std::fmt::Write for
GenericStringBuilder<O> {
}
}
+/// A byte size value representing the number of bytes to allocate per string
in [`GenericStringBuilder`]
+///
+/// To create a [`GenericStringBuilder`] using `.with_capacity` we are
required to provide: \
+/// - `item_capacity` - the row count \
+/// - `data_capacity` - total string byte count \
+///
+/// We will use the `AVERAGE_STRING_LENGTH` * row_count for `data_capacity`. \
+///
+/// These capacities are preallocation hints used to improve performance,
+/// but consuquences of passing a hint too large or too small should be
negligible.
+const AVERAGE_STRING_LENGTH: usize = 16;
+/// Trait for string-like array builders
+///
+/// This trait provides unified interface for builders that append string-like
data
+/// such as [`GenericStringBuilder<O>`] and
[`crate::builder::StringViewBuilder`]
+pub trait StringLikeArrayBuilder: ArrayBuilder {
+ /// Returns a human-readable type name for the builder.
+ fn type_name() -> &'static str;
+
+ /// Creates a new builder with the given row capacity.
Review Comment:
pedantically, this also allocates 16x the capacity for the variable payload
in StringArray/ LargeStringArray too (it isn't just the row capacity)
##########
parquet-variant-compute/src/variant_to_arrow.rs:
##########
@@ -413,6 +432,13 @@ macro_rules! define_variant_to_primitive_builder {
}
}
+define_variant_to_primitive_builder!(
+ struct VariantToStringArrowBuilder<'a, B: StringLikeArrayBuilder>
+ |capacity| -> B { B::with_capacity(capacity) },
+ |value| value.as_string(),
+ type_name: B::type_name()
+);
Review Comment:
that is pretty nice
##########
arrow-array/src/builder/generic_bytes_builder.rs:
##########
@@ -348,6 +348,40 @@ impl<O: OffsetSizeTrait> std::fmt::Write for
GenericStringBuilder<O> {
}
}
+const AVERAGE_STRING_LENGTH: usize = 16;
+/// Trait for string-like array builders
+///
+/// This trait provides unified interface for builders that append string-like
data
+/// such as [`GenericStringBuilder<O>`] and
[`crate::builder::StringViewBuilder`]
+pub trait StringLikeArrayBuilder: ArrayBuilder {
Review Comment:
FWIW this will go into arrow 57.1.0 (not in 57.0.0, which is due out
tomorrow).
##########
arrow-array/src/builder/generic_bytes_builder.rs:
##########
@@ -348,6 +348,40 @@ impl<O: OffsetSizeTrait> std::fmt::Write for
GenericStringBuilder<O> {
}
}
+const AVERAGE_STRING_LENGTH: usize = 16;
+/// Trait for string-like array builders
+///
+/// This trait provides unified interface for builders that append string-like
data
+/// such as [`GenericStringBuilder<O>`] and
[`crate::builder::StringViewBuilder`]
+pub trait StringLikeArrayBuilder: ArrayBuilder {
Review Comment:
I think it is ok to make it pub -- this seems like a reasonable API to me.
We actually have something like this in DataFusion already so it makes sense
##########
arrow-array/src/builder/generic_bytes_builder.rs:
##########
@@ -348,6 +348,40 @@ impl<O: OffsetSizeTrait> std::fmt::Write for
GenericStringBuilder<O> {
}
}
+const AVERAGE_STRING_LENGTH: usize = 16;
Review Comment:
My only real concern with this hint is that it will work for some users and
not for others (e.g. it will over allocate memory for short strings). Short of
forcing the caller to pass in the variable capacity I can't see any way around
it that doesn't have other tradeoffs
if the need lower level control they can always use the underlying builder,
so I think this default is ok
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]