Dandandan commented on issue #5547:
URL: 
https://github.com/apache/arrow-datafusion/issues/5547#issuecomment-1464090614

   > @Dandandan offered an interesting option trying the Primitive Dictionary 
Builder in [#5472 
(comment)](https://github.com/apache/arrow-datafusion/issues/5472#issuecomment-1454123133)
 It would be interesting to test it out, as I understand the idea is not to use 
`PrimitiveDictionaryBuilder` directly but build a similar more lightweight 
structure? Could elaborate a bit how do you see it?
   > 
   > Another question for Ballista, for distributed COUNT DISTINCT to get final 
aggregated result, they should use the same structure right ?
   
   Yes, we would basically not have to use the `keys_builder` and keep the rest 
similar to `GenericByteDictionaryBuilder`: 
https://docs.rs/arrow-array/34.0.0/src/arrow_array/builder/generic_bytes_dictionary_builder.rs.html#62
   
   For primitive values we don't even have to use a similar structure but can 
use something more like`HashSet<ArrowPrimitiveType::Native>` you suggested 
which will be close to optimal.
   
   So the structure could look something like:
   
   ```rust
   #[derive(Debug)]
   pub struct BinaryDistinctValues<K, T>
   where
       K: ArrowDictionaryKeyType,
       T: ByteArrayType,
   {
       state: ahash::RandomState,
       /// Used to provide a lookup from string value to key type
       ///
       /// Note: usize's hash implementation is not used, instead the raw entry
       /// API is used to store keys w.r.t the hash of the strings themselves
       ///
       dedup: HashMap<usize, (), ()>,
       values_builder: GenericByteBuilder<T>,
   }
   
   enum DistinctAggregationState {
      Primitive(HashSet< ArrowPrimitiveType>) // primitive values
      Binary(BinaryDistinctValues) // binary / UTF-8 values
   }
   ```
   
    (note this doesn't support the `Struct` type)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to