[GitHub] [arrow-datafusion] crepererum opened a new pull request, #5545: refactor: user may choose to dict-encode partition values

via GitHub Fri, 10 Mar 2023 03:27:48 -0800


crepererum opened a new pull request, #5545:
URL: https://github.com/apache/arrow-datafusion/pull/5545


   # Which issue does this PR close?
   \-
   
   # Rationale for this change
   Let the user decide if they may want to encode partition values for 
file-based data sources. Dictionary encoding makes sense for string values but 
is probably pointless or even counterproductive for integer types.
   
   # What changes are included in this PR?
   - the data type specified `partition_columns` is now the actual output type
   - SQL mapping has been changed to follow that EXCEPT the default for 
unspecified types which I kept at `Dict(u16, utf8)`
   - I kept the "buffer reuse" logic that was in there but extended it more 
dictionary types. I use a type-safe approach here because Arrow buffers may 
become typed at some point w/ the arrow2 merger
   
   # Are these changes tested?
   Adjusted existing tests.
   
   # Are there any user-facing changes?
   **BREAKING:** Types for partition columns in file-based sources are no 
longer dictionary encoded by default. The user MUST choose a dictionary type if 
they want to achieve this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] crepererum opened a new pull request, #5545: refactor: user may choose to dict-encode partition values

Reply via email to