alamb commented on pull request #9233: URL: https://github.com/apache/arrow/pull/9233#issuecomment-762174671
> If we are able to describe in the partitioning information that the partition is hashed by some column that is a dictionary, doesn't that allow us to perform very fast hashing (based on the dictionary indexes)? @jorgecarleitao yes I think that would be a great optimization, or possibly skipping hashing entirely and build the aggregate table entirely on the dictionary indexes -- I suspect this would work well in the common case, but we would have to handle the case where the dictionary itself is not the same across all record batches (and thus indexes in one record batch may not correspond to the same value in another) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
