nirandaperera edited a comment on pull request #10845:
URL: https://github.com/apache/arrow/pull/10845#issuecomment-891068825
> 2. It seems to me that ConsumeCachedProbeBatches is only called for a
single thread index - the one for the thread that reaches completion of
build_counter_.
>
Yes, thanks @michalursa. I missed this!
> 3. Because in StopProducing calls to Cancel() on two AtomicCounters
are connected with ||, finished_.MarkFinished() can be called twice (first
thread gets true from first counter Cancel() call and some other thread from
the second Cancel() call). Also, shouldn't we always Cancel() both counters?
>
I see... The `GroupByNode` had this,
```c++
if (input_counter_.Cancel()) {
finished_.MarkFinished();
} else if (output_counter_.Cancel()) {
finished_.MarkFinished();
}
```
and I was wondering why both the cases had the same code path. I thought it
can be combined in a single statement.
So, do you mean to say that `finished_.MarkFinished()` should be called if
`build_counter_.Cancel() && out_counter_.Cancel()`?
> 4. I wonder what happens with an empty input on build side?
>
This was my thought process. `build_counter_` has -1 for total initially.
So, until the `build_input` signals the `InputFinished` with 0, probe batches
will be cached. And when it receives 0, it toggles `build_side_complete_` and
probe batches will be queried against an empty hashmap.
We could actually return a NullArray in the `Grouper::Find` method,
prematurely (if the hashmap is empty). WDYT?
> 5. I think we will only have one class for hash join. It should be
fine to call it HashJoinNode and throw Status::NotImplemented() for join types
outside of semi, anti-semi. JoinType enum could have elements from all other
types as well. Also JoinType enum would be better as "enum class" although
Arrow C++ probably has some policy about enums.
>
Sure!
> 6. I would rename build_side_complete_ to hash_table_built_ or
hash_table_build_complete_. Currently I get it confused with build_counter_
checks, where one means all build side input batches consumed by local state
hash tables, and the other means hash table merge is complete.
>
Sure!
> 7. Also it would be nice to tie these two conditions above to futures,
so that a merge task and tasks to process cache probe side batches could be
generated and scheduled to execute once these futures are complete. But the
futures are not critical at this point, just something nice to have.
>
I will think about this one! :-)
> 8. Status returned from CacheProbeBatch is always OK()
>
I'll make this void!
> 9. We probably don't support DictionaryArray in key columns in the
code as it is right now, we should check and return Status::NotImplemented()
when making hash join node (or make sure it works). Also there could be a
scenario where one side of the join uses DictionaryArray while the other uses
Array with the same underlying type for keys to compare.
>
Sure!
> 10. In BuildSideMerge() ARROW_DCHECK(state->grouper). Perhaps it is a
copy-paste from group by node, but it would be good to have a comment about why
it is not possible to have states 0 and 2 initialized but not 1. This is not
obvious. And maybe it should just be relaxed to skip processing if the local
thread state with a given index is not initialized.
>
Yes, it is a copy from the GroupBy impl.
Ah! Good catch! that is something I didnt think about! Are we talking about
a case like this?
Ex: 4 threads, but only 1 input batch. So, before/while other batches being
initialized, thread0 receives the batch and calls `BuildSideMerge()`. Now,
other states could have null, and ideally we could `continue` the loop if that
is the case (because it is guaranteed that those states wouldn't receive any
more batches, because build_counter_ is already completed.)
> 11. TotalReached() method added to AtomicCounter is not used anywhere.
>
> 12. There is a problem with null key. I believe in hash join with
equality condition it should be that "null != null" (and there is usually a
separate comparison operator that treats nulls as equal), while in group by
"null==null" when matching groups. We should have a comment about it and
document it for the users (maybe we don't have documentation strings for exec
nodes yet). If needed we would have to filter out null keys separately from
Grouper.
>
I see... but it looks like Pandas holds null/NaN/na as a valid key and if
the users want to, they have to explicitly drop na values.
https://stackoverflow.com/questions/23940181/pandas-merging-with-missing-values
I started a thread on this in Zulip
https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Null.20values.20as.20keys
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]