jorgecarleitao commented on issue #506:
URL: https://github.com/apache/arrow-rs/issues/506#issuecomment-870024847


   Great issue description @alamb 🎩 
   
   I would do it on a separate kernel, as to not break the principle that 
concatenating arrays is an `O(N)` operation where `N` is the number of elements 
in all arrays (this is `O(N log N)`?)
   
   `ensure_sort: bool` or something like that would be a nice argument for such 
a function.
   
   In general, we have a small challenge in how we track dictionary metadata, 
though: our `DataType::Dictionary` does not hold dictionary metadata, which 
means that we must store it somewhere else. Yhis makes it more cumbersome, as 
the function cannot leverage this information to e.g. avoid re-sorting a sorted 
dictionary array without that other "dictionary metafata".
   
   My feeling is that we should (backward-incompatibly) extend 
`DataType::Dictionary(keys, values, metadata)` where `metadata` is a struct 
containing the different dictionary metadata available in `Field`, but I am not 
100% convinced about this.
   
   I also though about a more radical approach of removing 
`DataType::Dictionary`, since a Dictionary is not formally a DataType, but an 
array encoding. With that said, it does have a different physical 
representation, so in this sense it is convenient to write it as a separate 
`DataType` that can be `matched`. The disadvantage is we can't change an 
array's encoding without changing the logical type associated with it. This 
contrasts with parquet, where encodings and logical types are independent of 
each other.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to