alamb commented on issue #4729:
URL: https://github.com/apache/arrow-rs/issues/4729#issuecomment-1691482626

   ## Opinion
   
   Putting aside all practical implementation considerations initially,  I 
think that removing `DataType::Dictionary`  (and `DataType::REE` and 
`DataType::StringView`) from  `DataType` is a good idea as it has the following 
benefits:
   1. Allows the encoding to change from `RecordBatch` to `RecordBatch`, and 
thus can adapt to changing data rather than forcing a single static choice.
   2. Simplifies the type management for downstream systems like DataFusion 
   3. Make it easy to incrementally support for new encodings (like REE, 
StringView) in the future without changes to downtream systems 
   
   For example, it likely makes sense for the parquet reader to provide 
dictionary encoded strings (to match what came out of parquet), and then unpack 
this data once it hits some kernel that doesn't support Dictionary encoding or 
the data is filtered it down where the dictionary encoding overhead outweighs 
its benefits
   
   As pointed out by above, the major implication is that all the kernels would 
have to "support" Dictionary encoded data. I don't think this is as bad as it 
may initially seem: kernels without support for specific encodings could unpack 
the dictionaries (aka cast to the value type) and proceed. This is my 
understanding of how DuckDB works.
   
   The primary benefit of the current situation is that it is backwards 
compatible
   
   ## Steps forward
   So in my mind, if there was some way to get from today to an API where the 
encoding wasn't needed that would be a great. 
   
   I was thinking last night maybe we could somehow use the new 
[`Datum`](https://docs.rs/arrow-array/45.0.0/arrow_array/trait.Datum.html) API 
to achieve this. Single values are already some special encoding of arrays. 
Adding some way to encode dictionary information there too might fit.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to