Jefffrey commented on code in PR #8912:
URL: https://github.com/apache/arrow-rs/pull/8912#discussion_r2553988287
##########
arrow-cast/src/cast/dictionary.rs:
##########
@@ -28,111 +28,92 @@ pub(crate) fn dictionary_cast<K: ArrowDictionaryKeyType>(
) -> Result<ArrayRef, ArrowError> {
use DataType::*;
- match to_type {
- Dictionary(to_index_type, to_value_type) => {
- let dict_array = array
Review Comment:
I moved this inner logic into a separate function to be similar to how the
other arms delegate to functions for the full logic.
##########
arrow-cast/src/cast/dictionary.rs:
##########
@@ -157,21 +138,17 @@ fn view_from_dict_values<K: ArrowDictionaryKeyType, T:
ByteViewType, V: ByteArra
}
}
}
- Ok(builder.finish())
+ Ok(Arc::new(builder.finish()))
}
-// Unpack a dictionary where the keys are of type <K> into a flattened array
of type to_type
-pub(crate) fn unpack_dictionary<K>(
- array: &dyn Array,
+// Unpack a dictionary into a flattened array of type to_type
+pub(crate) fn unpack_dictionary<K: ArrowDictionaryKeyType>(
+ array: &DictionaryArray<K>,
Review Comment:
Moved the cast to dictionary array to the parent function
(`dictionary_cast()`) so we don't need to always cast in each of the
specialized functions.
##########
arrow-cast/src/cast/dictionary.rs:
##########
@@ -28,111 +28,92 @@ pub(crate) fn dictionary_cast<K: ArrowDictionaryKeyType>(
) -> Result<ArrayRef, ArrowError> {
use DataType::*;
- match to_type {
- Dictionary(to_index_type, to_value_type) => {
- let dict_array = array
- .as_any()
- .downcast_ref::<DictionaryArray<K>>()
- .ok_or_else(|| {
- ArrowError::ComputeError(
- "Internal Error: Cannot cast dictionary to
DictionaryArray of expected type".to_string(),
- )
- })?;
+ let array = array.as_dictionary::<K>();
+ let from_child_type = array.values().data_type();
+ match (from_child_type, to_type) {
+ (_, Dictionary(to_index_type, to_value_type)) => {
+ dictionary_to_dictionary_cast(array, to_index_type, to_value_type,
cast_options)
+ }
+ // `unpack_dictionary` can handle Utf8View/BinaryView types, but
incurs unnecessary data
+ // copy of the value buffer. Fast path which avoids copying underlying
values buffer.
+ // TODO: handle LargeUtf8/LargeBinary -> View (need to check offsets
can fit)
+ // TODO: handle cross types (String -> BinaryView, Binary ->
StringView)
+ // (need to validate utf8?)
+ (Utf8, Utf8View) => view_from_dict_values::<K, Utf8Type,
StringViewType>(
+ array.keys(),
+ array.values().as_string::<i32>(),
+ ),
+ (Binary, BinaryView) => view_from_dict_values::<K, BinaryType,
BinaryViewType>(
+ array.keys(),
+ array.values().as_binary::<i32>(),
+ ),
Review Comment:
This is the main change; we check the values beforehand to see if it is
valid for the fast path, which was missing before (and the assumption lead to
an error). I've intentionally kept the behaviour as intended (limited the fast
path only for `Dictionary<Utf8>` -> `Utf8View`, `Dictionary<Binary>` ->
`BinaryView`) since was mainly interested in just fixing the cast bug. Left
comments for potentially extending this fast path to be valid for more
combinations (can raise issues for this).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]