alamb commented on code in PR #6180:
URL: https://github.com/apache/arrow-rs/pull/6180#discussion_r1701962474
##########
arrow-cast/src/cast/mod.rs:
##########
@@ -2417,6 +2436,37 @@ where
Ok(Arc::new(byte_array_builder.finish()))
}
+/// Specialized function to cast from one `ByteViewType` array to
`ByteArrayType` array.
+/// Equvilent to [`cast_view_to_byte`] but with additional constraint on the
`FROM::Native` type.
+fn cast_slice_view_to_byte<FROM, TO>(array: &dyn Array) -> Result<ArrayRef,
ArrowError>
+where
+ FROM: ByteViewType,
+ TO: ByteArrayType,
+ FROM::Native: AsRef<[u8]>,
+ str: AsRef<TO::Native>,
+{
+ let data = array.to_data();
+ let view_array = GenericByteViewArray::<FROM>::from(data);
+
+ let len = view_array.len();
+ let bytes = view_array
+ .views()
+ .iter()
+ .map(|v| ByteView::from(*v).length as usize)
+ .sum::<usize>();
+
+ let mut byte_array_builder = GenericByteBuilder::<TO>::with_capacity(len,
bytes);
+
+ for val in view_array.iter() {
Review Comment:
From what I understand this code will effectively do UTF-8 validation string
by string, which @XiangpengHao found was much slower than doing the validation
all at once -- see https://github.com/apache/arrow-rs/issues/5995 and
https://github.com/apache/arrow-rs/pull/6009.
I think we should optimize the conversion (can be a follow on PR) so that
the utf8 validation is done once on the byte array data rather than element by
element
The pattern would likely look like something like
1. create offsets and contiguous buffer with the bytes copied
2. If a `StringArray` or `LargeStringArray`, Validate that the entire buffer
was valid utf8 in one call
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]