felipecrv commented on issue #40128: URL: https://github.com/apache/arrow/issues/40128#issuecomment-1954354911
My take here is that dictionaries are for compression and might be aggressively further-compressed by some operations [1], but `cast` is special and *should preserve* the original dictionary because the goal of a cast is to change only the encoding of the data with as little as possible impact on its semantic properties. I would keep the fast path that and make only cast kernels the exception. I don't know exactly how without looking at the code. My reasoning: we don't want code accidentally relying on a guarantee that is stronger than "the dictionary contains the values that are referenced at least once". The path to full determinism here is probably being more aggressive in removing values, than preserving them. [1] many operations act on the `array[0..length)` values and might gather only the dictionary values that have references to them -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
