Re: [PR] Support casting between BinaryView <--> Utf8 and LargeUtf8 [arrow-rs]

via GitHub Fri, 02 Aug 2024 07:59:55 -0700


alamb commented on code in PR #6180:
URL: https://github.com/apache/arrow-rs/pull/6180#discussion_r1701962474



##########
arrow-cast/src/cast/mod.rs:
##########
@@ -2417,6 +2436,37 @@ where
     Ok(Arc::new(byte_array_builder.finish()))
 }
 
+/// Specialized function to cast from one `ByteViewType` array to 
`ByteArrayType` array.
+/// Equvilent to [`cast_view_to_byte`] but with additional constraint on the 
`FROM::Native` type.
+fn cast_slice_view_to_byte<FROM, TO>(array: &dyn Array) -> Result<ArrayRef, 
ArrowError>
+where
+    FROM: ByteViewType,
+    TO: ByteArrayType,
+    FROM::Native: AsRef<[u8]>,
+    str: AsRef<TO::Native>,
+{
+    let data = array.to_data();
+    let view_array = GenericByteViewArray::<FROM>::from(data);
+
+    let len = view_array.len();
+    let bytes = view_array
+        .views()
+        .iter()
+        .map(|v| ByteView::from(*v).length as usize)
+        .sum::<usize>();
+
+    let mut byte_array_builder = GenericByteBuilder::<TO>::with_capacity(len, 
bytes);
+
+    for val in view_array.iter() {

Review Comment:
   From what I understand this code will effectively do UTF-8 validation string 
by string, which @XiangpengHao  found was much slower than doing the validation 
all at once -- see https://github.com/apache/arrow-rs/issues/5995 and 
https://github.com/apache/arrow-rs/pull/6009.
   
   I think we should optimize the conversion (can be a follow on PR) so that 
the utf8 validation is done once on the byte array data rather than element by 
element
   
   The pattern would likely look like something like
   1. create offsets and contiguous buffer with the bytes copied
   2. If a `StringArray` or `LargeStringArray`, Validate that the entire buffer 
was valid utf8 in one call
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Support casting between BinaryView <--> Utf8 and LargeUtf8 [arrow-rs]

Reply via email to