sunchao commented on issue #4729: URL: https://github.com/apache/arrow-rs/issues/4729#issuecomment-1690716072
> TLDR I would strongly resist changes to this design without extremely compelling empirical data to support it. Understood. Like I said, I mainly want to invite a discussion on this topic and see if there is anything we can do to improve dictionary support in Arrow :) > Have you tried just not using dictionaries? I'm being somewhat glib, but do you have empirical data to back up their usage? To better integrate with DF, we are about to do early unpacking of dictionary arrays where value type is primitive type, and see how it goes. We still plan to keep the dictionary arrays with string & binary type as it is though. I'm surprised to hear that even string dictionaries do not perform well in your case. I can collect more data points on this. > On a more holistic level, pushing dictionaries into PrimitiveArray and friends would implicitly force all kernels to support dictionaries. This would either need to be custom logic, something we have found to be extremely painful to maintain and result in huge amounts of codegen, or would silently materialize the dictionary, potentially blowing the memory budget and having poor performance. > > The current approach where type coercion machinery explicitly materializes dictionaries when necessary, with this explicitly called out in the plan and optimized to be only once, seems a better approach imo... Assuming we want to treat dictionary array as a first-class citizen. In the current model, some kernels may support it while others don't, which causes this inconsistency that we've seen often in the past. Like you said, the new model forces all the kernels to support dictionary, but wouldn't this better than having the inconsistencies? It may not necessarily mean more complexities, since some kernels like filter can directly operate on the unified API without needing to be aware of the underlying encoding. > Why would filtering a primitive dictionary be faster than filtering primitives, they're the same thing? Because the filter predicate only need to be evaluated on the dictionary values, whose cardinality could be much lower? especially if the predicate is a complex one like a UDF. > You need to specify -C target-cpu=native -C opt-level=3 in order to get LLVM to vectorize anything at all. You also need to wrap the vector in std::hint::black_box to stop LLVM using compile time information it wouldn't have in practice. This then showcases what I'm referring to, it generates very suboptimal code for loops containing branches. Hmm I tried it except instead of `target-cpu=native`, I use `target-cpu=haswell`. The result is here: https://rust.godbolt.org/z/Mbq7Yaeh5. I can still see SIMD getting triggered. Notice the jump chain: `LBB25_7` -> `.LBB25_10` -> `LBB25_15` -> `LBB25_16`. > I would be very surprised if that is getting optimised correctly, I also can't seem to find any uses of it within the C++ kernels - https://github.com/search?q=repo%3Aapache%2Farrow%20GetValueIndex&type=code. Yea it would be too bad if it isn't optimized 😂 . This function is used in `ScalarFromArraySlotImpl::Visit(const DictionaryArray& a)` which in turn is called by `Array::GetScalar`, which seems like a pretty fundamental method. > Edit: you may notice I have a bit of a chip on my shoulder when it comes to dictionaries, for which I apologise, I've just found them to be a perennial source of frustration... Totally understand. We too share the frustration on this topic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
