Dandandan commented on issue #5547: URL: https://github.com/apache/arrow-datafusion/issues/5547#issuecomment-1464090614
> @Dandandan offered an interesting option trying the Primitive Dictionary Builder in [#5472 (comment)](https://github.com/apache/arrow-datafusion/issues/5472#issuecomment-1454123133) It would be interesting to test it out, as I understand the idea is not to use `PrimitiveDictionaryBuilder` directly but build a similar more lightweight structure? Could elaborate a bit how do you see it? > > Another question for Ballista, for distributed COUNT DISTINCT to get final aggregated result, they should use the same structure right ? Yes, we would basically not have to use the `keys_builder` and keep the rest similar to `GenericByteDictionaryBuilder`: https://docs.rs/arrow-array/34.0.0/src/arrow_array/builder/generic_bytes_dictionary_builder.rs.html#62 For primitive values we don't even have to use a similar structure but can use something more like`HashSet<ArrowPrimitiveType::Native>` you suggested which will be close to optimal. So the structure could look something like: ```rust #[derive(Debug)] pub struct BinaryDistinctValues<K, T> where K: ArrowDictionaryKeyType, T: ByteArrayType, { state: ahash::RandomState, /// Used to provide a lookup from string value to key type /// /// Note: usize's hash implementation is not used, instead the raw entry /// API is used to store keys w.r.t the hash of the strings themselves /// dedup: HashMap<usize, (), ()>, values_builder: GenericByteBuilder<T>, } enum DistinctAggregationState { Primitive(HashSet< ArrowPrimitiveType>) // primitive values Binary(BinaryDistinctValues) // binary / UTF-8 values } ``` (note this doesn't support the `Struct` type) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
