sunchao commented on issue #4729:
URL: https://github.com/apache/arrow-rs/issues/4729#issuecomment-1690716072

   > TLDR I would strongly resist changes to this design without extremely 
compelling empirical data to support it.
   
   Understood. Like I said, I mainly want to invite a discussion on this topic 
and see if there is anything we can do to improve dictionary support in Arrow :)
   
   > Have you tried just not using dictionaries? I'm being somewhat glib, but 
do you have empirical data to back up their usage?
   
   To better integrate with DF, we are about to do early unpacking of 
dictionary arrays where value type is primitive type, and see how it goes. We 
still plan to keep the dictionary arrays with string & binary type as it is 
though. I'm surprised to hear that even string dictionaries do not perform well 
in your case. I can collect more data points on this. 
   
   > On a more holistic level, pushing dictionaries into PrimitiveArray and 
friends would implicitly force all kernels to support dictionaries. This would 
either need to be custom logic, something we have found to be extremely painful 
to maintain and result in huge amounts of codegen, or would silently 
materialize the dictionary, potentially blowing the memory budget and having 
poor performance.
   >
   > The current approach where type coercion machinery explicitly materializes 
dictionaries when necessary, with this explicitly called out in the plan and 
optimized to be only once, seems a better approach imo...
   
   Assuming we want to treat dictionary array as a first-class citizen. In the 
current model, some kernels may support it while others don't, which causes 
this inconsistency that we've seen often in the past.
   
   Like you said, the new model forces all the kernels to support dictionary, 
but wouldn't this better than having the inconsistencies? It may not 
necessarily mean more complexities, since some kernels like filter can directly 
operate on the unified API without needing to be aware of the underlying 
encoding.
   
   > Why would filtering a primitive dictionary be faster than filtering 
primitives, they're the same thing?
   
   Because the filter predicate only need to be evaluated on the dictionary 
values, whose cardinality could be much lower? especially if the predicate is a 
complex one like a UDF.
   
   > You need to specify -C target-cpu=native -C opt-level=3 in order to get 
LLVM to vectorize anything at all. You also need to wrap the vector in 
std::hint::black_box to stop LLVM using compile time information it wouldn't 
have in practice. This then showcases what I'm referring to, it generates very 
suboptimal code for loops containing branches.
   
   Hmm I tried it except instead of `target-cpu=native`, I use 
`target-cpu=haswell`. The result is here: https://rust.godbolt.org/z/Mbq7Yaeh5. 
I can still see SIMD getting triggered. Notice the jump chain: `LBB25_7` -> 
`.LBB25_10` -> `LBB25_15` -> `LBB25_16`. 
   
   > I would be very surprised if that is getting optimised correctly, I also 
can't seem to find any uses of it within the C++ kernels - 
https://github.com/search?q=repo%3Aapache%2Farrow%20GetValueIndex&type=code.
   
   Yea it would be too bad if it isn't optimized 😂 . This function is used in 
`ScalarFromArraySlotImpl::Visit(const DictionaryArray& a)` which in turn is 
called by `Array::GetScalar`, which seems like a pretty fundamental method.
   
   > Edit: you may notice I have a bit of a chip on my shoulder when it comes 
to dictionaries, for which I apologise, I've just found them to be a perennial 
source of frustration...
   
   Totally understand. We too share the frustration on this topic. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to