Copilot commented on code in PR #12231:
URL: https://github.com/apache/gluten/pull/12231#discussion_r3356291571
##########
cpp/velox/shuffle/VeloxHashShuffleWriter.cc:
##########
@@ -1514,22 +1513,33 @@ arrow::Result<int64_t>
VeloxHashShuffleWriter::evictPartitionBuffersMinSize(int6
pidToSize.emplace_back(pid, partitionBufferSize_[pid]);
}
std::sort(pidToSize.begin(), pidToSize.end(), [&](const auto& a, const auto&
b) { return a.second > b.second; });
- if (!pidToSize.empty()) {
- for (auto& item : pidToSize) {
- auto pid = item.first;
- ARROW_ASSIGN_OR_RAISE(auto buffers, assembleBuffers(pid, false));
- auto* types = partitionWriter_->enableTypeAwareCompress() ?
&tacBufferTypes_ : nullptr;
- auto payload = std::make_unique<InMemoryPayload>(
- item.second, &isValidityBuffer_, schema_, std::move(buffers),
hasComplexType_, types);
- metrics_.totalBytesToEvict += payload->rawSize();
- RETURN_NOT_OK(partitionWriter_->hashEvict(pid, std::move(payload),
Evict::kSpill, false, writtenBytes_));
- evicted = beforeEvict - partitionBufferPool_->bytes_allocated();
- if (evicted >= size) {
- break;
- }
+
+ struct PartitionPayload {
+ uint32_t pid;
+ std::unique_ptr<InMemoryPayload> payload;
+ };
+ std::vector<PartitionPayload> selectedPayloads;
+ int64_t selectedBytes = 0;
+ for (auto& item : pidToSize) {
+ auto pid = item.first;
+ ARROW_ASSIGN_OR_RAISE(auto buffers, assembleBuffers(pid, false));
Review Comment:
`evictPartitionBuffersMinSize()` now assembles buffers and materializes
`InMemoryPayload`s for all selected partitions before calling `hashEvict()`.
Since `assembleBuffers()` can allocate extra memory (e.g., complex-type flush
buffer allocation/reserve) and none of the selected partition buffers are
released until the second loop, this can increase peak memory usage and raise
OOM risk compared to the previous per-partition spill loop (which could free
memory incrementally and stop early). Consider selecting only partition IDs in
the first pass, then sorting by pid and doing `assembleBuffers()+hashEvict()`
in pid order so memory can be released as each partition is spilled.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]