[GitHub] [arrow-datafusion] alamb commented on pull request #1860: Increase default partition column type from Dict(UInt8) to Dict(UInt16)

GitBox Wed, 02 Mar 2022 08:33:24 -0800


alamb commented on pull request #1860:
URL: 
https://github.com/apache/arrow-datafusion/pull/1860#issuecomment-1057125192



   > The idea behind using UInt8 is that the values of a given partition column 
within a file will be all identical. If I have to materialize a large array 
with only zeros, I would rather not encode each 0 on 64 bits 😄. 
   
   I think this PR proposes to use 16 bits rather than 64 to allow more than 
256 distinct partition values. One example usecase might be when there are more 
than 256 distinct postal codes in the United States)
   
   > To actually have a record batch with multiple partition values, you would 
need to go through something like the concat kernel first. Wouldn't it make 
sense to rely on that kernel to re-cast the index type appropriately? I think 
that it would be a safer approach in general to avoid overflowing when merging 
dictionaries.
   
   Having some way to dynamically pick the size of the dictionary keys 
certainly seems like a nice feature -- I am not sure how large of a change it 
would be though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on pull request #1860: Increase default partition column type from Dict(UInt8) to Dict(UInt16)

Reply via email to