Tom-Newton commented on issue #15233: URL: https://github.com/apache/arrow/issues/15233#issuecomment-2508410291
I found a couple more related problems and I've ended up with https://github.com/Tom-Newton/arrow/pull/13. This includes 3 changes: 1. Use `CloseAsync()` inside `copy_one_file` so that it returns futures, instead of blocking while it waits for other tasks to complete. This avoids the deadlock. 2. Fix a segfault in `FileInterface::CloseAsync()` by using `shared_from_this()` to so that it doesn't get destructed prematurely. `shared_from_this()` is already used in the `CloseAsync()` implementation on several filesystems. After change 1 this is important when downloading to the local filesystem using `CopyFiles`. 3. Spawn `copy_one_file` less urgently than default, so that `background_writes` are done with higher priority. Otherwise `copy_one_file` will keep buffering more data in memory without giving the `background_writes` any chance to upload the data and drop it from memory. Therefore, without this large copies would cause OOMs. 1. The current Arrow thread pool implementation makes provision for spawning with a set priority but the priority numbers are just ignored. 1. I made a small change to use a `std::priority_queue` to implement the priority. 1. There is a property of the current implementation that tasks will be scheduled in the order of calls to `Spawn`. There are no tests to assert this behaviour, and I don't know if it is depended on by anything but currently my `std::priority_queue` version removes this property. 1. I think this has got to be the most contentious part of the change, with the broadest possible impact if I introduce a bug. I would appreciate some input on whether this seems like a reasonable change. I've been testing copying between Azure and local filesystems using a directory of about 50,000 files of varying sizes totalling 160GiB. With all the described changes both upload and download work reliably, with reasonable memory usage. I'm also relatively happy with the performance - it fluctuates more than I would like, but at times it saturates the network interface on my Azure VM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
