alamb commented on code in PR #2322:
URL: https://github.com/apache/arrow-rs/pull/2322#discussion_r938720792
##########
parquet/src/arrow/arrow_writer/byte_array.rs:
##########
@@ -33,11 +32,38 @@ use crate::schema::types::ColumnDescPtr;
use crate::util::bit_util::num_required_bits;
use crate::util::interner::{Interner, Storage};
use arrow::array::{
- Array, ArrayAccessor, ArrayRef, BinaryArray, LargeBinaryArray,
LargeStringArray,
- StringArray,
+ Array, ArrayAccessor, ArrayRef, BinaryArray, DictionaryArray,
LargeBinaryArray,
+ LargeStringArray, StringArray,
};
use arrow::datatypes::DataType;
+macro_rules! downcast_dict_impl {
+ ($array:ident, $key:ident, $val:ident, $op:expr $(, $arg:expr)*) => {{
+ $op($array
+ .as_any()
+ .downcast_ref::<DictionaryArray<arrow::datatypes::$key>>()
Review Comment:
When you say "could be made faster" the issue is that this is effectively
creating something that will iterate over strings, which means that to encode
the column, the arrow array dictionary index is used to find a string, which is
then used to find the parquet array index which is then written.
It could potentially be faster if we skipped the string step in the middle
and simply computed an arrow dictionary index --> parquet dictionary index
mapping up front and applied that mapping during writing
(I think you said this in this PR's description, but I am restating it to
confirm I understand what is happening)
##########
arrow/src/array/array_dictionary.rs:
##########
@@ -421,14 +421,25 @@ impl<T: ArrowPrimitiveType> fmt::Debug for
DictionaryArray<T> {
/// assert_eq!(maybe_val.unwrap(), orig)
/// }
/// ```
-#[derive(Copy, Clone)]
pub struct TypedDictionaryArray<'a, K: ArrowPrimitiveType, V> {
/// The dictionary array
dictionary: &'a DictionaryArray<K>,
/// The values of the dictionary
values: &'a V,
}
+// Manually implement `Clone` to avoid `V: Clone` type constraint
Review Comment:
it is strange that having a reference to `& V` would require `V: Clone` in
order to `#[derive(Clone)]` 🤷
##########
arrow/src/util/data_gen.rs:
##########
@@ -143,6 +143,17 @@ pub fn create_random_array(
})
.collect::<Result<Vec<(&str, ArrayRef)>>>()?,
)?),
+ d @ Dictionary(_, value_type)
+ if crate::compute::can_cast_types(value_type, d) =>
Review Comment:
using `cast` is a neat trick here 👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]