alamb commented on issue #4729: URL: https://github.com/apache/arrow-rs/issues/4729#issuecomment-1691482626
## Opinion Putting aside all practical implementation considerations initially, I think that removing `DataType::Dictionary` (and `DataType::REE` and `DataType::StringView`) from `DataType` is a good idea as it has the following benefits: 1. Allows the encoding to change from `RecordBatch` to `RecordBatch`, and thus can adapt to changing data rather than forcing a single static choice. 2. Simplifies the type management for downstream systems like DataFusion 3. Make it easy to incrementally support for new encodings (like REE, StringView) in the future without changes to downtream systems For example, it likely makes sense for the parquet reader to provide dictionary encoded strings (to match what came out of parquet), and then unpack this data once it hits some kernel that doesn't support Dictionary encoding or the data is filtered it down where the dictionary encoding overhead outweighs its benefits As pointed out by above, the major implication is that all the kernels would have to "support" Dictionary encoded data. I don't think this is as bad as it may initially seem: kernels without support for specific encodings could unpack the dictionaries (aka cast to the value type) and proceed. This is my understanding of how DuckDB works. The primary benefit of the current situation is that it is backwards compatible ## Steps forward So in my mind, if there was some way to get from today to an API where the encoding wasn't needed that would be a great. I was thinking last night maybe we could somehow use the new [`Datum`](https://docs.rs/arrow-array/45.0.0/arrow_array/trait.Datum.html) API to achieve this. Single values are already some special encoding of arrays. Adding some way to encode dictionary information there too might fit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
