michalursa edited a comment on pull request #10845:
URL: https://github.com/apache/arrow/pull/10845#issuecomment-890166609


   1. The comment about UINT32_MAX probably needs to be updated
   
   2. It seems to me that ConsumeCachedProbeBatches is only called for a single 
thread index - the one for the thread that reaches completion of build_counter_.
   
   3. Because in StopProducing calls to Cancel() on two AtomicCounters are 
connected with ||, finished_.MarkFinished() can be called twice (first thread 
gets true from first counter Cancel() call and some other thread from the 
second Cancel() call). Also, shouldn't we always Cancel() both counters?
   
   4. I wonder what happens with an empty input on build side?
   
   5. I think we will only have one class for hash join. It should be fine to 
call it HashJoinNode and throw Status::NotImplemented() for join types outside 
of semi, anti-semi. JoinType enum could have elements from all other types as 
well. Also JoinType enum would be better as "enum class" although Arrow C++ 
probably has some policy about enums.
   
   6. I would rename build_side_complete_ to hash_table_built_ or 
hash_table_build_complete_. Currently I get it confused with build_counter_ 
checks, where one means all build side input batches consumed by local state 
hash tables, and the other means hash table merge is complete.
   
   7. Also it would be nice to tie these two conditions above to futures, so 
that a merge task and tasks to process cache probe side batches could be 
generated and scheduled to execute once these futures are complete. But the 
futures are not critical at this point, just something nice to have.
   
   8. Status returned from CacheProbeBatch is always OK()
   
   9. We probably don't support DictionaryArray in key columns in the code as 
it is right now, we should check and return Status::NotImplemented() when 
making hash join node (or make sure it works). Also there could be a scenario 
where one side of the join uses DictionaryArray while the other uses Array with 
the same underlying type for keys to compare.
   
   10. In BuildSideMerge() ARROW_DCHECK(state->grouper). Perhaps it is a 
copy-paste from group by node, but it would be good to have a comment about why 
it is not possible to have states 0 and 2 initialized but not 1. This is not 
obvious. And maybe it should just be relaxed to skip processing if the local 
thread state with a given index is not initialized.
   
   11. TotalReached() method added to AtomicCounter is not used anywhere.
   
   12. There is a problem with null key. I believe in hash join with equality 
condition it should be that "null != null" (and there is usually a separate 
comparison operator that treats nulls as equal), while in group by "null==null" 
when matching groups. We should have a comment about it and document it for the 
users (maybe we don't have documentation strings for exec nodes yet). If needed 
we would have to filter out null keys separately from Grouper.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to