rtpsw commented on code in PR #13880:
URL: https://github.com/apache/arrow/pull/13880#discussion_r958707960
##########
cpp/src/arrow/compute/exec/asof_join_node.cc:
##########
@@ -294,10 +452,22 @@ class InputState {
// Index of the time col
col_index_t time_col_index_;
// Index of the key col
- col_index_t key_col_index_;
+ vec_col_index_t key_col_index_;
+ // Type id of the time column
+ Type::type time_type_id_;
+ // Type id of the key column
+ std::vector<Type::type> key_type_id_;
+ // Hasher for key elements
+ mutable KeyHasher* key_hasher_;
+ // True if hashing is mandatory
+ bool must_hash_;
+ // True if null by-key values are expected
+ bool nullable_by_key_;
Review Comment:
I'll look into this, then.
>ideally the engine knows if a column is nullable and utilize the fast path
without user specifying any flags.
Unfortunately, The engine doesn't know this. In fact, whether a column is
nullable cannot be known upfront; even after an input node has generated a
million exec-batches where the column is non-null the next exec-batch could
have this column nullable. The only way to deal with this is dynamically, by
rehashing the memo-store, which means replacing the fast-path keys with
slow-path keys for all entries. This replacement's cost is incurred at most
once in the life of an `AsofJoinNode`, because there is no replacement back
(though doing this is possible in a different algorithm), so it's not too bad
from a performance perspective; it's just more complex logic around hashing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]