alamb commented on code in PR #2322:
URL: https://github.com/apache/arrow-rs/pull/2322#discussion_r938720792


##########
parquet/src/arrow/arrow_writer/byte_array.rs:
##########
@@ -33,11 +32,38 @@ use crate::schema::types::ColumnDescPtr;
 use crate::util::bit_util::num_required_bits;
 use crate::util::interner::{Interner, Storage};
 use arrow::array::{
-    Array, ArrayAccessor, ArrayRef, BinaryArray, LargeBinaryArray, 
LargeStringArray,
-    StringArray,
+    Array, ArrayAccessor, ArrayRef, BinaryArray, DictionaryArray, 
LargeBinaryArray,
+    LargeStringArray, StringArray,
 };
 use arrow::datatypes::DataType;
 
+macro_rules! downcast_dict_impl {
+    ($array:ident, $key:ident, $val:ident, $op:expr $(, $arg:expr)*) => {{
+        $op($array
+            .as_any()
+            .downcast_ref::<DictionaryArray<arrow::datatypes::$key>>()

Review Comment:
   When you say "could be made faster" the issue is that this is effectively 
creating something that will iterate over strings, which means that to encode 
the column, the arrow array dictionary index is used to find a string, which is 
then used to find the parquet array index which is then written.
   
    It could potentially be faster if we skipped the string step in the middle 
and simply computed an arrow dictionary index --> parquet dictionary index 
mapping up front and applied that mapping during writing
   
   (I think you said this in this PR's description, but I am restating it to 
confirm I understand what is happening)
   



##########
arrow/src/array/array_dictionary.rs:
##########
@@ -421,14 +421,25 @@ impl<T: ArrowPrimitiveType> fmt::Debug for 
DictionaryArray<T> {
 ///     assert_eq!(maybe_val.unwrap(), orig)
 /// }
 /// ```
-#[derive(Copy, Clone)]
 pub struct TypedDictionaryArray<'a, K: ArrowPrimitiveType, V> {
     /// The dictionary array
     dictionary: &'a DictionaryArray<K>,
     /// The values of the dictionary
     values: &'a V,
 }
 
+// Manually implement `Clone` to avoid `V: Clone` type constraint

Review Comment:
   it is strange that having a reference to `& V` would require `V: Clone` in 
order to `#[derive(Clone)]` 🤷 



##########
arrow/src/util/data_gen.rs:
##########
@@ -143,6 +143,17 @@ pub fn create_random_array(
                 })
                 .collect::<Result<Vec<(&str, ArrayRef)>>>()?,
         )?),
+        d @ Dictionary(_, value_type)
+            if crate::compute::can_cast_types(value_type, d) =>

Review Comment:
   using `cast` is a neat trick here 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to