pitrou commented on pull request #9683:
URL: https://github.com/apache/arrow/pull/9683#issuecomment-800442398


   There are indeed two possible approaches:
   * unify all chunks first, and then run the unique kernel over the transposed 
incides (as proposed by @rok)
   * run the unique kernel over the original chunks, and then hash-aggregate 
the unique results of the different chunks (in effect `SELECT sum(counts) GROUP 
BY values`)
   
   The second approach could be faster in the (unusual?) cases where only a 
small subset of dictionary values actually appear in the data. If most 
dictionary values are used, both cases should have similar performance, though.
   
   Since we don't have a generic hash-aggregate yet, the first approach sounds 
good enough.
   (also note that `unique` is in itself a special case of hash-aggregation)
   
   cc @bkietz for opinions


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to