alamb commented on code in PR #5871:
URL: https://github.com/apache/arrow-rs/pull/5871#discussion_r1635097217
##########
arrow-cast/src/cast/dictionary.rs:
##########
@@ -85,10 +85,97 @@ pub(crate) fn dictionary_cast<K: ArrowDictionaryKeyType>(
Ok(new_array)
}
+ Utf8View => {
+ // `unpack_dictionary` can handle Utf8View/BinaryView types, but
incurs unnecessary data copy of the value buffer.
Review Comment:
💯 for comments
##########
arrow-cast/src/cast/dictionary.rs:
##########
@@ -85,10 +85,97 @@ pub(crate) fn dictionary_cast<K: ArrowDictionaryKeyType>(
Ok(new_array)
}
+ Utf8View => {
+ // `unpack_dictionary` can handle Utf8View/BinaryView types, but
incurs unnecessary data copy of the value buffer.
+ // we handle it here to avoid the copy.
+ let dict_array = array
+ .as_dictionary::<K>()
+ .downcast_dict::<StringArray>()
+ .unwrap();
+
+ let string_values = dict_array.values();
+ let value_offsets = string_values.value_offsets();
+ let value_buffer = string_values.values().clone();
+
+ let view_buffer =
+ view_from_dict_values(value_offsets, &value_buffer,
dict_array.keys());
+
+ // Safety:
+ // the buffer is from StringArray which is utf8.
+ let string_view = unsafe {
+ StringViewArray::new_unchecked(
+ view_buffer,
+ vec![value_buffer],
+ dict_array.nulls().cloned(),
Review Comment:
I think calling `nulls()` doesn't handle the case where the dictionary value
itself (rather than the key) was null
https://github.com/apache/arrow-rs/blob/cf59b6cd826412635dc391d4cf0f9d8310f5a226/arrow-array/src/array/dictionary_array.rs#L727
I think this should call `logical_nulls()` instead:
https://github.com/apache/arrow-rs/blob/cf59b6cd826412635dc391d4cf0f9d8310f5a226/arrow-array/src/array/dictionary_array.rs#L731
Also, it would be good to create a test case that covers this too
##########
arrow-cast/src/cast/mod.rs:
##########
@@ -5257,10 +5249,30 @@ mod tests {
let binary_view_array = cast(&binary_array,
&DataType::BinaryView).unwrap();
assert_eq!(binary_view_array.data_type(), &DataType::BinaryView);
- let expect_binary_view_array = BinaryViewArray::from(data);
+ let expect_binary_view_array =
BinaryViewArray::from_iter(VIEW_TEST_DATA);
assert_eq!(binary_view_array.as_ref(), &expect_binary_view_array);
}
+ #[test]
+ fn test_dict_to_view() {
+ let string_view_array = StringViewArray::from_iter(VIEW_TEST_DATA);
+ let string_dict_array: DictionaryArray<Int8Type> =
VIEW_TEST_DATA.into_iter().collect();
Review Comment:
Can you please update this test to have a dictionary array that has:
1. Repeated use of dictionary values
2. keys that are not all increasing
3. Nulls in the values (as well as the keys)
Perhaps what you can do is create a `StringArray` from `VIEW_TEST_DATA` and
then make create the indexes manually
Like this example
https://docs.rs/arrow/latest/arrow/array/struct.DictionaryArray.html#example-from-existing-arrays
Does that make sense?
##########
arrow-cast/src/cast/dictionary.rs:
##########
@@ -85,10 +85,97 @@ pub(crate) fn dictionary_cast<K: ArrowDictionaryKeyType>(
Ok(new_array)
}
+ Utf8View => {
+ // `unpack_dictionary` can handle Utf8View/BinaryView types, but
incurs unnecessary data copy of the value buffer.
+ // we handle it here to avoid the copy.
+ let dict_array = array
+ .as_dictionary::<K>()
+ .downcast_dict::<StringArray>()
+ .unwrap();
+
+ let string_values = dict_array.values();
+ let value_offsets = string_values.value_offsets();
+ let value_buffer = string_values.values().clone();
+
+ let view_buffer =
+ view_from_dict_values(value_offsets, &value_buffer,
dict_array.keys());
+
+ // Safety:
+ // the buffer is from StringArray which is utf8.
+ let string_view = unsafe {
+ StringViewArray::new_unchecked(
+ view_buffer,
+ vec![value_buffer],
+ dict_array.nulls().cloned(),
+ )
+ };
+ Ok(Arc::new(string_view))
+ }
+ BinaryView => {
+ // `unpack_dictionary` can handle Utf8View/BinaryView types, but
incurs unnecessary data copy of the value buffer.
+ // we handle it here to avoid the copy.
+ let dict_array = array
+ .as_dictionary::<K>()
+ .downcast_dict::<BinaryArray>()
+ .unwrap();
+
+ let binary_values = dict_array.values();
+ let value_offsets = binary_values.value_offsets();
+ let value_buffer = binary_values.values().clone();
+
+ let view_buffer =
+ view_from_dict_values(value_offsets, &value_buffer,
dict_array.keys());
+ let binary_view = unsafe {
+ BinaryViewArray::new_unchecked(
+ view_buffer,
+ vec![value_buffer],
+ dict_array.nulls().cloned(),
+ )
+ };
+ Ok(Arc::new(binary_view))
+ }
_ => unpack_dictionary::<K>(array, to_type, cast_options),
}
}
+fn view_from_dict_values<K: ArrowDictionaryKeyType>(
+ value_offsets: &[i32],
+ value_buffer: &arrow_buffer::Buffer,
+ keys: &PrimitiveArray<K>,
+) -> ScalarBuffer<u128> {
+ let mut view_builder = BufferBuilder::<u128>::new(keys.len());
+ for i in keys.iter() {
+ match i {
+ Some(v) => {
+ let idx = v.to_usize().unwrap();
+ let offset = value_offsets[idx];
+ let end = value_offsets[idx + 1];
+ let length = end - offset;
+ let value_buf = &value_buffer[offset as usize..end as usize];
+
+ if length <= 12 {
Review Comment:
Instead of creating the views directly, what do you think about using
`try_append_view` and `try_append_block` added in this PR:
https://github.com/apache/arrow-rs/pull/5796
Maybe as bonus points you could add a benchmark for this casting operation,
and see if adding `append_view_unchecked` would make any difference
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]