Tom-Newton commented on code in PR #44897:
URL: https://github.com/apache/arrow/pull/44897#discussion_r1869468022
##########
cpp/src/arrow/filesystem/filesystem.cc:
##########
@@ -642,12 +645,31 @@ Status CopyFiles(const std::vector<FileLocator>& sources,
ARROW_ASSIGN_OR_RAISE(auto destination,
destinations[i].filesystem->OpenOutputStream(
destinations[i].path,
metadata));
RETURN_NOT_OK(internal::CopyStream(source, destination, chunk_size,
io_context));
- return destination->Close();
+ // Using the blocking Close() here can cause reduced performance and
deadlocks because
+ // FileSystem implementations that implement background_writes need to
queue and wait
+ // for other IO thread(s). There is a risk that most or all the threads in
the IO
+ // thread pool are blocking on a call Close(), leaving no IO threads left
to actually
+ // fulfil the background writes.
+ return destination->CloseAsync();
};
- return ::arrow::internal::OptionalParallelFor(
- use_threads, static_cast<int>(sources.size()), std::move(copy_one_file),
- io_context.executor());
+ // Spawn copy_one_file less urgently than default, so that background_writes
are done
+ // with higher priority. Otherwise copy_one_file will keep buffering more
data in memory
+ // without giving the background_writes any chance to upload the data and
drop it from
+ // memory. Therefore, without this large copies would cause OOMs.
Review Comment:
Yes, that is why I updated `cpp/src/arrow/util/thread_pool.cc` to make use
of `TaskHints`... I've been trying to get some opinion from maintainers on
whether this is a reasonable thing to do.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]