sunchao commented on issue #4729: URL: https://github.com/apache/arrow-rs/issues/4729#issuecomment-1690570631
Thanks @tustvold for the quick reply! I'm mainly throwing out ideas for discussion here, since we've been struggling with dictionary types for a while now in our internal integration with Arrow & DF. > It would add a huge amount of additional API complexity, as the arrays would go from having a well defined layout to having potentially different layouts - functions like PrimitiveArray::values, PrimitiveArray::new would need to change, etc... Hmm why it would add API complexity? I'm thinking `PrimitiveArray::values` still share the same impl if it is non-dictionary, but returns unpacked values if it is? `PrimitiveArray::new` doesn't have to change, and we can have a separate constructor for the dictionary case (such as `PrimitiveArray::new_dict`). > It would add a branch on value access, which will break vectorisation of most kernels I'm hoping the compiler is smart enough to optimize this away. I briefly tried [some similar ideas](https://rust.godbolt.org/z/7jW57M81W) here in Godbolt and it looks like LLVM can indeed do that. I also ran some benchmarks with basic `add_scalar` and `add_dyn` and didn't see any regression AFAICT. > You would end up with duplicated dictionary logic spread across multiple array types Yes this is true, but I think it can be avoided though perhaps with some trait level default methods. > It is unclear how one would type the dictionary keys TBH I don't have good ideas on this without breaking type safety. One thing we can do is to follow [Arrow C++](https://github.com/apache/arrow/blob/main/cpp/src/arrow/array/array_dict.cc#L55) and let compiler to optimize away the branching. > This model does not seem to be followed by any arrow implementations AFAICT? Right, but I don't think this can be used as a strong reason to _not_ doing it though. The idea is partially inspired by Velox which is compatible with Arrow. > I'm unclear on the benefits of dictionary encoding primitives, in most cases the representation will be larger and slower to process. A Dictionary<Int32, Int32Array> can only ever be larger than the corresponding Int32Array Yes in certain cases it will be more beneficial to unpack primitive dictionary arrays first, as you've shown in https://github.com/apache/arrow-rs/pull/4407 :). This is orthogonal to this issue though, since the caller can decide whether to encode dictionary primitives or not. I think primitive dictionaries may still more beneficial in some other cases (e.g., filter) > IMO DataType IS the physical type, it provides separate physical representations for the same logical types like decimals, intervals, floats, strings, etc... Query engines can then typically implement a logical type system on top of this That is true. In this specific case I'm talking about DataFusion though, which uses DataType as logical type I think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
