pitrou commented on pull request #9683: URL: https://github.com/apache/arrow/pull/9683#issuecomment-800442398
There are indeed two possible approaches: * unify all chunks first, and then run the unique kernel over the transposed incides (as proposed by @rok) * run the unique kernel over the original chunks, and then hash-aggregate the unique results of the different chunks (in effect `SELECT sum(counts) GROUP BY values`) The second approach could be faster in the (unusual?) cases where only a small subset of dictionary values actually appear in the data. If most dictionary values are used, both cases should have similar performance, though. Since we don't have a generic hash-aggregate yet, the first approach sounds good enough. (also note that `unique` is in itself a special case of hash-aggregation) cc @bkietz for opinions ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
