Jefffrey commented on code in PR #19441:
URL: https://github.com/apache/datafusion/pull/19441#discussion_r2637876566
##########
datafusion/common/src/scalar/mod.rs:
##########
@@ -9232,6 +9245,27 @@ mod tests {
}
}
+ #[test]
+ fn test_views_minimize_memory() {
+ let value = "this string is longer than 12 bytes".to_string();
+
+ let scalar = ScalarValue::Utf8View(Some(value.clone()));
+ let array = scalar.to_array_of_size(10).unwrap();
+ let array = array.as_string_view();
+ let buffers = array.data_buffers();
+ assert_eq!(1, buffers.len());
+ // Ensure we only have a single copy of the value string
+ assert_eq!(value.len(), buffers[0].len());
Review Comment:
On main this assert fails as the data buffer would be `10 * value.len()`;
with this fix we only have a single copy of the full string in the child
buffer, minimizing memory footprint of the output array
##########
datafusion/common/src/scalar/mod.rs:
##########
@@ -3027,7 +3028,13 @@ impl ScalarValue {
},
ScalarValue::Utf8View(e) => match e {
Some(value) => {
- Arc::new(StringViewArray::from_iter_values(repeat_n(value,
size)))
+ let mut builder =
+
StringViewBuilder::with_capacity(size).with_deduplicate_strings();
+ for _ in 0..size {
+ builder.append_value(value);
+ }
+ let array = builder.finish();
+ Arc::new(array)
Review Comment:
Technically we could further optimize this by manually calling
`StringViewArray::try_new` with correct data and avoid need to hash as part of
the builder; felt that might get too into the weeds of arrow-rs code, so stuck
with this simpler approach
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]