sunchao commented on issue #4729:
URL: https://github.com/apache/arrow-rs/issues/4729#issuecomment-1690570631

   Thanks @tustvold for the quick reply! I'm mainly throwing out ideas for 
discussion here, since we've been struggling with dictionary types for a while 
now in our internal integration with Arrow & DF.
   
   > It would add a huge amount of additional API complexity, as the arrays 
would go from having a well defined layout to having potentially different 
layouts - functions like PrimitiveArray::values, PrimitiveArray::new would need 
to change, etc...
   
   Hmm why it would add API complexity? I'm thinking `PrimitiveArray::values` 
still share the same impl if it is non-dictionary, but returns unpacked values 
if it is? `PrimitiveArray::new` doesn't have to change, and we can have a 
separate constructor for the dictionary case (such as 
`PrimitiveArray::new_dict`).
   
   > It would add a branch on value access, which will break vectorisation of 
most kernels
   
   I'm hoping the compiler is smart enough to optimize this away. I briefly 
tried [some similar ideas](https://rust.godbolt.org/z/7jW57M81W) here in 
Godbolt and it looks like LLVM can indeed do that. I also ran some benchmarks 
with basic `add_scalar` and `add_dyn` and didn't see any regression AFAICT.
   
   > You would end up with duplicated dictionary logic spread across multiple 
array types
   
   Yes this is true, but I think it can be avoided though perhaps with some 
trait level default methods.
   
   > It is unclear how one would type the dictionary keys
   
   TBH I don't have good ideas on this without breaking type safety. One thing 
we can do is to follow [Arrow 
C++](https://github.com/apache/arrow/blob/main/cpp/src/arrow/array/array_dict.cc#L55)
 and let compiler to optimize away the branching.
   
   > This model does not seem to be followed by any arrow implementations 
AFAICT?
   
   Right, but I don't think this can be used as a strong reason to _not_ doing 
it though. The idea is partially inspired by Velox which is compatible with 
Arrow.
   
   > I'm unclear on the benefits of dictionary encoding primitives, in most 
cases the representation will be larger and slower to process. A 
Dictionary<Int32, Int32Array> can only ever be larger than the corresponding 
Int32Array
   
   Yes in certain cases it will be more beneficial to unpack primitive 
dictionary arrays first, as you've shown in 
https://github.com/apache/arrow-rs/pull/4407 :). This is orthogonal to this 
issue though, since the caller can decide whether to encode dictionary 
primitives or not. I think primitive dictionaries may still more beneficial in 
some other cases (e.g., filter)
   
   > IMO DataType IS the physical type, it provides separate physical 
representations for the same logical types like decimals, intervals, floats, 
strings, etc... Query engines can then typically implement a logical type 
system on top of this
   
   That is true. In this specific case I'm talking about DataFusion though, 
which uses DataType as logical type I think?
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to