zanmato1984 commented on code in PR #45789:
URL: https://github.com/apache/arrow/pull/45789#discussion_r2003072183
##########
cpp/src/arrow/acero/groupby_aggregate_node.cc:
##########
@@ -312,7 +312,7 @@ Result<ExecBatch> GroupByNode::Finalize() {
segment_key_field_ids_.size());
// Segment keys come first
- PlaceFields(out_data, 0, segmenter_values_);
+ PlaceFields(out_data, 0, state->segmenter_values);
Review Comment:
See [1], the `Finalize` is following a `Merge` call, by when the states of
all threads have been merged into `state[0]`.
But this raises an interesting question that I only realized now: the
proposed thread-local `segmenter_values`s are never merged, how dare I only
append `segmenter_values` in `state[0]`? This seems wrong but turns out to be
OK because of some implicit assumptions quite far away:
1) If the `segmenter_values_` is not empty, then at least one segment key is
specified, then we are executing single-threaded and only "state[0]" exists.
This is detailed in my answer to the other comment.
2) Else, then there are no segment keys and we are appending nothing.
Even though the fix actually works, it "feels" more weird than before - a
shared `segmenter_values_` seems to be, at least conceptually, more reasonable
(though it has race and isn't really correct in the context of
multi-threading). This makes me hesitate about the current fix. I think it all
boils down to the "abstraction leak" in the original design:
1. The segmenter abstraction seems to be designed to be independent of the
assumption that "no multi-threading for segmented cases" (and honestly I am
personally quite fond of this design).
2. The existence of `segmenter_values_` strongly depends on the assumption
of single-threaded.
Now I will try to fix the race in another way more independent of this
abstraction leak, leaving the later to be addressed in the future.
[1]
https://github.com/apache/arrow/blob/fc0862a25fa2d6aa17df14daa9ad37430d432425/cpp/src/arrow/acero/groupby_aggregate_node.cc#L353-L354
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]