[GitHub] [arrow] nirandaperera commented on pull request #10845: ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join

GitBox Mon, 02 Aug 2021 07:21:36 -0700


nirandaperera commented on pull request #10845:
URL: https://github.com/apache/arrow/pull/10845#issuecomment-891068825



   >     2. It seems to me that ConsumeCachedProbeBatches is only called for a 
single thread index - the one for the thread that reaches completion of 
build_counter_.
   > 
   Yes, thanks @michalursa. I missed this! 
   
   >     3. Because in StopProducing calls to Cancel() on two AtomicCounters 
are connected with ||, finished_.MarkFinished() can be called twice (first 
thread gets true from first counter Cancel() call and some other thread from 
the second Cancel() call). Also, shouldn't we always Cancel() both counters?
   > 
   I see... The `GroupByNode` had this,
   ```c++
       if (input_counter_.Cancel()) {
         finished_.MarkFinished();
       } else if (output_counter_.Cancel()) {
         finished_.MarkFinished();
       }
   ``` 
   and I was wondering why both the cases had the same code path. I thought it 
can be combined in a single statement. 
   
   So, do you mean to say that `finished_.MarkFinished()` should be called if 
`build_counter_.Cancel() && out_counter_.Cancel()`?
   
   >     4. I wonder what happens with an empty input on build side?
   > 
   This was my thought process. `build_counter_` has -1 for total initially. 
So, until the `build_input` signals the `InputFinished` with 0, probe batches 
will be cached. And when it receives 0, it toggles `build_side_complete_` and 
probe batches will be queried against an empty hashmap. 
   We could actually return a NullArray in the `Grouper::Find` method, 
prematurely (if the hashmap is empty). WDYT?
   
   >     5. I think we will only have one class for hash join. It should be 
fine to call it HashJoinNode and throw Status::NotImplemented() for join types 
outside of semi, anti-semi. JoinType enum could have elements from all other 
types as well. Also JoinType enum would be better as "enum class" although 
Arrow C++ probably has some policy about enums.
   > 
   Sure! 
   
   >     6. I would rename build_side_complete_ to hash_table_built_ or 
hash_table_build_complete_. Currently I get it confused with build_counter_ 
checks, where one means all build side input batches consumed by local state 
hash tables, and the other means hash table merge is complete.
   > 
   Sure!
   
   >     7. Also it would be nice to tie these two conditions above to futures, 
so that a merge task and tasks to process cache probe side batches could be 
generated and scheduled to execute once these futures are complete. But the 
futures are not critical at this point, just something nice to have.
   > 
   I will think about this one! :-)
   
   >     8. Status returned from CacheProbeBatch is always OK()
   > 
   I'll make this void!
   
   >     9. We probably don't support DictionaryArray in key columns in the 
code as it is right now, we should check and return Status::NotImplemented() 
when making hash join node (or make sure it works). Also there could be a 
scenario where one side of the join uses DictionaryArray while the other uses 
Array with the same underlying type for keys to compare.
   > 
   Sure!
   
   >     10. In BuildSideMerge() ARROW_DCHECK(state->grouper). Perhaps it is a 
copy-paste from group by node, but it would be good to have a comment about why 
it is not possible to have states 0 and 2 initialized but not 1. This is not 
obvious. And maybe it should just be relaxed to skip processing if the local 
thread state with a given index is not initialized.
   > 
   Yes, it is a copy from the GroupBy impl.
   Ah! Good catch! that is something I didnt think about! Are we talking about 
a case like this?
   Ex: 4 threads, but only 1 input batch. So, before/while other batches being 
initialized, thread0 receives the batch and calls `BuildSideMerge()`. Now, 
other states could have null, and ideally we could `continue` the loop if that 
is the case (because it is guaranteed that those states wouldn't receive any 
more batches, because build_counter_ is already completed.)
   
   >     11. TotalReached() method added to AtomicCounter is not used anywhere.
   > 
   >     12. There is a problem with null key. I believe in hash join with 
equality condition it should be that "null != null" (and there is usually a 
separate comparison operator that treats nulls as equal), while in group by 
"null==null" when matching groups. We should have a comment about it and 
document it for the users (maybe we don't have documentation strings for exec 
nodes yet). If needed we would have to filter out null keys separately from 
Grouper.
   > 
   I see... but it looks like Pandas holds null/NaN/na as a valid key and if 
the users want to, they have to explicitly drop na values. 
   
https://stackoverflow.com/questions/23940181/pandas-merging-with-missing-values
   I started a thread on this in Zulip 
https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Null.20values.20as.20keys
   
   
   Thank you very 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nirandaperera commented on pull request #10845: ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join

Reply via email to