Tom-Newton commented on issue #15233:
URL: https://github.com/apache/arrow/issues/15233#issuecomment-2508410291

   I found a couple more related problems and I've ended up with 
https://github.com/Tom-Newton/arrow/pull/13. This includes 3 changes:
   
   1. Use `CloseAsync()` inside `copy_one_file` so that it returns futures, 
instead of blocking while it waits for other tasks to complete. This avoids the 
deadlock.
   2. Fix a segfault in `FileInterface::CloseAsync()` by using 
`shared_from_this()` to so that it doesn't get destructed prematurely. 
`shared_from_this()` is already used in the `CloseAsync()` implementation on 
several filesystems. After change 1 this is important when downloading to the 
local filesystem using `CopyFiles`.
   3. Spawn `copy_one_file` less urgently than default, so that 
`background_writes` are done with higher priority. Otherwise `copy_one_file` 
will keep buffering more data in memory without giving the `background_writes` 
any chance to upload the data and drop it from memory. Therefore, without this 
large copies would cause OOMs.
       1. The current Arrow thread pool implementation makes provision for 
spawning with a set priority but the priority numbers are just ignored. 
       1. I made a small change to use a `std::priority_queue` to implement the 
priority. 
       1. There is a property of the current implementation that tasks will be 
scheduled in the order of calls to `Spawn`. There are no tests to assert this 
behaviour, and I don't know if it is depended on by anything but currently my 
`std::priority_queue` version removes this property.
       1. I think this has got to be the most contentious part of the change, 
with the broadest possible impact if I introduce a bug. I would appreciate some 
input on whether this seems like a reasonable change. 
   
   I've been testing copying between Azure and local filesystems using a 
directory of about 50,000 files of varying sizes totalling 160GiB. With all the 
described changes both upload and download work reliably, with reasonable 
memory usage. I'm also relatively happy with the performance - it fluctuates 
more than I would like, but at times it saturates the network interface on my 
Azure VM.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to