jhorstmann commented on code in PR #1048:
URL: https://github.com/apache/arrow-rs/pull/1048#discussion_r889508402


##########
arrow/src/array/array_dictionary.rs:
##########
@@ -163,6 +165,92 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> {
         self.is_ordered
     }
 
+    /// Returns a DictionaryArray referencing the same data
+    /// with the [DictionaryArray::is_ordered] flag set to `true`.
+    /// Note that this does not actually reorder the values in the dictionary.
+    pub fn as_ordered(&self) -> Self {
+        Self {
+            data: self.data.clone(),
+            values: self.values.clone(),
+            keys: PrimitiveArray::<K>::from(self.keys.data().clone()),
+            is_ordered: true,
+        }
+    }
+
+    pub fn make_ordered(&self) -> Result<Self> {
+        let values = self.values();
+        if self.is_ordered || values.is_empty() {
+            Ok(self.as_ordered())
+        } else {
+            // validate up front that we can do conversions from/to usize for 
the whole range of keys
+            // this allows using faster unchecked conversions below
+            K::Native::from_usize(values.len())
+                .ok_or(ArrowError::DictionaryKeyOverflowError)?;
+            // sort indices are u32 so we cannot sort larger dictionaries
+            u32::try_from(values.len())
+                .map_err(|_| ArrowError::DictionaryKeyOverflowError)?;
+
+            // sort the dictionary values
+            let sort_indices = sort_to_indices(values, None, None)?;
+            let sorted_dictionary = take(
+                values.as_ref(),
+                &sort_indices,
+                Some(TakeOptions {
+                    check_bounds: false,
+                }),
+            )?;
+
+            // build a lookup table from old to new key
+            let mut lookup = vec![0; sort_indices.len()];
+            sort_indices
+                .values()
+                .iter()
+                .enumerate()
+                .for_each(|(i, idx)| {
+                    lookup[*idx as usize] = i;
+                });
+
+            let mapped_keys_iter = self.keys_iter().map(|opt_key| {
+                if let Some(key) = opt_key {
+                    // Safety:
+                    // lookup has the same length as the dictionary values
+                    // so if the keys were valid for values they will be valid 
indices into lookup
+                    unsafe {
+                        debug_assert!(key < lookup.len());
+                        let new_key = *lookup.get_unchecked(key);
+                        debug_assert!(new_key < values.len());
+                        K::Native::from_usize(new_key).unwrap_unchecked()
+                    }
+                } else {
+                    K::default_value()
+                }
+            });
+
+            // Safety:
+            // PrimitiveIter has a trusted len
+            let new_key_buffer =
+                unsafe { Buffer::from_trusted_len_iter(mapped_keys_iter) };
+
+            // Safety:
+            // after remapping the keys will be in the same range as before
+            let new_data = unsafe {
+                ArrayData::new_unchecked(
+                    self.data_type().clone(),
+                    self.len(),
+                    Some(self.data.null_count()),

Review Comment:
   I don't think I benchmarked this separately, but it should be quite a bit 
faster. In the most common case of offset being zero this would be zero-copy, 
while creating a PrimitiveArray from an iterator has some overhead per element, 
checking whether the buffers need to grow and setting individual bits in the 
validity bitmap. If there is no validity, the iterator api would also construct 
one first.
   
   I might try looking into the performance of the iterator based apis in a 
separate issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to