alamb commented on a change in pull request #1048:
URL: https://github.com/apache/arrow-rs/pull/1048#discussion_r772629985



##########
File path: arrow/src/array/ord.rs
##########
@@ -303,6 +362,23 @@ pub mod tests {
         Ok(())
     }
 
+    #[test]
+    fn test_dict_keys() -> Result<()> {
+        let data = vec!["a", "b", "c", "a", "a", "c", "c"];
+        let array = data.into_iter().collect::<DictionaryArray<Int16Type>>();

Review comment:
       Are we sure that this line produces a `DictionaryArray` with a sorted 
dictionary? I think it uses a `StringDictionaryBuilder` which uses a `HashMap` 
for key / values (and thus I don't think gives any guarantee about the 
ordering). 

##########
File path: arrow/src/compute/kernels/sort.rs
##########
@@ -403,6 +439,10 @@ pub struct SortOptions {
     pub descending: bool,
     /// Whether to sort nulls first
     pub nulls_first: bool,
+    /// Whether dictionary arrays can be sorted by their keys instead of 
values.

Review comment:
       There are already some hooks in the arrow codebase for `is_ordered` -- 
like `DictionaryArray::is_ordered`
   
   What do you think about using those hooks rather than a new 
`assume_sorted_dictionaries` option on `SortOptions` -- that would make it 
harder to pick the wrong option
   
   
https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+is_ordered&patternType=literal
   
   Perhaps we could add a function like
   
   ```rust
   impl DictionaryArray {
   
   fn sorted(self) -> Self { 
     // check if dictionary is already sorted, 
     // otherwise sort it
     Self { 
       is_ordered: true
       self,
     }
   }
   ``` 
   
   With an unsafe variant to skip the validation

##########
File path: arrow/src/compute/kernels/sort.rs
##########
@@ -526,12 +567,35 @@ where
             .map(|index| (index, values.value(index as usize)))
             .collect::<Vec<(u32, T::Native)>>()
     };
-    sort_primitive_inner(values, null_indices, cmp, options, limit, valids)
+    sort_primitive_inner(null_indices, cmp, options, limit, valids)
+}
+
+/// Sort a dictionary array by its keys

Review comment:
       ```suggestion
   /// Sort a dictionary array by the numeric value of its keys. 
   /// If the dictionary is sorted, this will produce the same result 
   /// as sorting by the unpacked values. 
   ```

##########
File path: arrow/src/compute/kernels/sort.rs
##########
@@ -403,6 +439,10 @@ pub struct SortOptions {
     pub descending: bool,
     /// Whether to sort nulls first
     pub nulls_first: bool,
+    /// Whether dictionary arrays can be sorted by their keys instead of 
values.

Review comment:
       Avoiding a new field in `SortOptions` would also likely reduce the size 
of this PR in terms of number of lines changed as well as keep the change API 
compatible. 

##########
File path: arrow/src/array/ord.rs
##########
@@ -119,9 +138,32 @@ where
 /// # Ok(())
 /// # }
 /// ```
+/// The returned comparator should only be called for non-null elements as the 
result is undefined otherwise.
+/// The caller should check the validity of both indices and then decide 
whether NULLs should come first or last.
 // This is a factory of comparisons.
-// The lifetime 'a enforces that we cannot use the closure beyond any of the 
array's lifetime.

Review comment:
       ๐Ÿ‘ ๐Ÿงน 

##########
File path: arrow/src/array/ord.rs
##########
@@ -119,9 +138,32 @@ where
 /// # Ok(())
 /// # }
 /// ```
+/// The returned comparator should only be called for non-null elements as the 
result is undefined otherwise.
+/// The caller should check the validity of both indices and then decide 
whether NULLs should come first or last.
 // This is a factory of comparisons.
-// The lifetime 'a enforces that we cannot use the closure beyond any of the 
array's lifetime.
 pub fn build_compare(left: &dyn Array, right: &dyn Array) -> 
Result<DynComparator> {
+    build_compare_with_options(left, right, &SortOptions::default())
+}
+
+/// returns a comparison function that compares two values at two different 
positions
+/// between the two arrays.
+/// The arrays' types must be equal.
+///
+/// Only the `assume_sorted_dictionaries` flag of [`SortOptions`] is currently 
used.
+/// If this flag is set and the arrays are dictionary encoded the comparator 
will use the keys
+/// instead of values for comparing. This is more efficient if the 
dictionaries are sorted or
+/// if the sorting is done for partitioning purposes. Setting this flag will 
give unexpected results
+/// when comparing two arrays with different dictionaries.
+///
+/// Other flags need to be handled when calling the comparator.

Review comment:
       Here is another place where using the `is_ordered` flag directly on the 
`DictionaryArray` might make the interface (much) cleaner. 

##########
File path: arrow/src/array/ord.rs
##########
@@ -101,6 +102,24 @@ where
     })
 }
 
+fn compare_dict_key<T>(left: &dyn Array, right: &dyn Array) -> DynComparator
+where
+    T: ArrowDictionaryKeyType,
+    T::Native: Ord,
+{
+    let left = left.as_any().downcast_ref::<DictionaryArray<T>>().unwrap();
+    let right = right.as_any().downcast_ref::<DictionaryArray<T>>().unwrap();
+
+    let left_keys: PrimitiveArray<T> = 
PrimitiveArray::from(left.keys().data().clone());
+    let right_keys: PrimitiveArray<T> = 
PrimitiveArray::from(right.keys().data().clone());
+
+    Box::new(move |i: usize, j: usize| {
+        let key_left = left_keys.value(i);
+        let key_right = right_keys.value(j);
+        <T::Native as Ord>::cmp(&key_left, &key_right)
+    })

Review comment:
       I don't understand why we need to replicate the logic from 
`build_compare` here.
   
   I think it is possible to call `build_compare` directly:
   
   ```suggestion
       build_compare(left.keys(), right.keys())
   ```
   
   
   I had to make a few other changes, but then it worked ๐Ÿ‘ 
   ```diff
   diff --git a/arrow/src/array/ord.rs b/arrow/src/array/ord.rs
   index 8497a4167d..5573bd67e6 100644
   --- a/arrow/src/array/ord.rs
   +++ b/arrow/src/array/ord.rs
   @@ -102,22 +102,14 @@ where
        })
    }
    
   -fn compare_dict_key<T>(left: &dyn Array, right: &dyn Array) -> DynComparator
   +fn compare_dict_key<T>(left: &dyn Array, right: &dyn Array) -> 
Result<DynComparator>
    where
        T: ArrowDictionaryKeyType,
        T::Native: Ord,
    {
        let left = left.as_any().downcast_ref::<DictionaryArray<T>>().unwrap();
        let right = 
right.as_any().downcast_ref::<DictionaryArray<T>>().unwrap();
   -
   -    let left_keys: PrimitiveArray<T> = 
PrimitiveArray::from(left.keys().data().clone());
   -    let right_keys: PrimitiveArray<T> = 
PrimitiveArray::from(right.keys().data().clone());
   -
   -    Box::new(move |i: usize, j: usize| {
   -        let key_left = left_keys.value(i);
   -        let key_right = right_keys.value(j);
   -        <T::Native as Ord>::cmp(&key_left, &key_right)
   -    })
   +    build_compare(left.keys(), right.keys())
    }
    
    /// returns a comparison function that compares two values at two different 
positions
   @@ -242,14 +234,14 @@ pub fn build_compare_with_options(
                    ));
                } else if options.assume_sorted_dictionaries {
                    match (key_type_lhs.as_ref(), key_type_rhs.as_ref()) {
   -                    (UInt8, UInt8) => compare_dict_key::<UInt8Type>(left, 
right),
   -                    (UInt16, UInt16) => 
compare_dict_key::<UInt16Type>(left, right),
   -                    (UInt32, UInt32) => 
compare_dict_key::<UInt32Type>(left, right),
   -                    (UInt64, UInt64) => 
compare_dict_key::<UInt64Type>(left, right),
   -                    (Int8, Int8) => compare_dict_key::<Int8Type>(left, 
right),
   -                    (Int16, Int16) => compare_dict_key::<Int16Type>(left, 
right),
   -                    (Int32, Int32) => compare_dict_key::<Int32Type>(left, 
right),
   -                    (Int64, Int64) => compare_dict_key::<Int64Type>(left, 
right),
   +                    (UInt8, UInt8) => compare_dict_key::<UInt8Type>(left, 
right)?,
   +                    (UInt16, UInt16) => 
compare_dict_key::<UInt16Type>(left, right)?,
   +                    (UInt32, UInt32) => 
compare_dict_key::<UInt32Type>(left, right)?,
   +                    (UInt64, UInt64) => 
compare_dict_key::<UInt64Type>(left, right)?,
   +                    (Int8, Int8) => compare_dict_key::<Int8Type>(left, 
right)?,
   +                    (Int16, Int16) => compare_dict_key::<Int16Type>(left, 
right)?,
   +                    (Int32, Int32) => compare_dict_key::<Int32Type>(left, 
right)?,
   +                    (Int64, Int64) => compare_dict_key::<Int64Type>(left, 
right)?,
                        (lhs, _) => {
                            return Err(ArrowError::InvalidArgumentError(format!(
                                "Dictionaries do not support keys of type {:?}",
   ```

##########
File path: arrow/benches/partition_kernels.rs
##########
@@ -39,6 +43,38 @@ where
     Arc::new(array)
 }
 
+fn create_sorted_dictionary_array<T: ArrowDictionaryKeyType>(
+    size: usize,
+    num_distinct_keys: usize,
+) -> ArrayRef
+where
+    Standard: Distribution<T::Native>,
+{
+    let mut rng = seedable_rng();
+
+    let key_array: ArrayRef =

Review comment:
       It would be cool if we could add an option to `StringDictionaryBuilder` 
to ensure the resulting dictionary was sorted

##########
File path: arrow/src/compute/kernels/sort.rs
##########
@@ -2357,12 +2495,116 @@ mod tests {
             Some(SortOptions {
                 descending: false,
                 nulls_first: false,
+                assume_sorted_dictionaries: false,
             }),
             Some(2),
             vec![Some("def"), None],
         );
     }
 
+    #[test]
+    fn test_sort_dicts_by_key() {
+        test_sort_string_dict_arrays::<Int8Type>(
+            vec![None, Some("B"), Some("A"), None, Some("C"), Some("A")],
+            Some(SortOptions {
+                assume_sorted_dictionaries: true,
+                ..Default::default()
+            }),
+            None,
+            vec![None, None, Some("B"), Some("A"), Some("A"), Some("C")],

Review comment:
       I must be missing something here -- shouldn't this output be sorted? 
Namely `None, None, Some("A"), Some("A"), Some("B"), Some("C")`?
   
   I also don't understand how this code ensures that the `DictionaryArrays` 
actually have sorted dictionaries as I don't see `test_sort_string_dict_arrays` 
doing anything special




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to