rdettai edited a comment on pull request #1860:
URL: 
https://github.com/apache/arrow-datafusion/pull/1860#issuecomment-1058110737


   > I think this PR proposes to use 16 bits rather than 64 to allow more than 
256 distinct partition values. One example usecase might be when there are more 
than 256 distinct postal codes in the United States)
   
   I am not challenging that you can have partitions keys with billions of 
different values 🙂. But I think that this isn't the best place to bump the 
dictionary index size as it is correct to say that at the file level, you 
cannot have more than one different value in a partition column for one record 
batch. It would be nicer to upcast this type downstream, when the record 
batches are manipulated in a way that implies that this uniqueness doesn't hold 
anymore (like after a `concat` op).  Also, it would be even nicer if we had 
https://github.com/apache/arrow-datafusion/issues/1248 instead 😄 
   
   If we find that it is too complex to do it downstream, I am not firmly 
opposed to upcast the type here, but then I agree with @yjshen that u16 isn't 
really enough. Also, making it customizable introduces some tuning complexity 
that isn't really ideal either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to