westonpace opened a new issue, #36523:
URL: https://github.com/apache/arrow/issues/36523

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I encountered this while stress testing #35440 with TSAN enabled but I 
believe it's not related to #35440 and, indeed, not a bug at all.  I'm 
attaching the log from TSAN here but I'll give a summary.
   [tsan_out.txt](https://github.com/apache/arrow/files/11973450/tsan_out.txt)
   
   There are three locks involved and so it is quite confusing.
   
   M0: The holder lock (`std::mutex S3ClientHolder::mutex_`)
   M1: The finalizer lock (`std::shared_mutex S3ClientFinalizer::mutex_`)
   M2: The upload state lock (`std::mutex UploadState::mutex`)
   
   TSAN reports `Cycle in lock order graph: M0 => M1 => M2 => M0`
   
   The task is "creating a bunch of files in parallel".
   
   In the main thread we are calling create dir.  This all happens in 
`arrow::fs::S3FileSystem::Impl::CreateBucket`.  As part of this process we:
   
    * Obtain Exclusive Lock M0
    * Obtain Shared Lock M1
   
   In a CPU thread we are closing a file, which causes us to upload the final 
part.  This all happens as part of `arrow::fs::(anonymous 
namespace)::ObjectOutputStream::UploadPart`.  As part of this process we:
   
    * Obtain Shared Lock M1
    * Obtain Exclusive Lock M2
   
   In an I/O thread we are handling the upload outcome which causes us to call 
close on the file.  This all happens as part of `arrow::fs::(anonymous 
namespace)::ObjectOutputStream::HandleUploadOutcome`.  As part of this process 
we:
   
    * Obtain Exclusive Lock M2
    * Obtain Exclusive Lock M0
   
   However, given that M1 is a shared lock, I'm not sure this deadlock that 
TSAN is reporting is possible.
   
   Unfortunately, I think there is actually a potential deadlock here, just not 
the one that TSAN is reporting.  As part of `HandleUploadOutcome` it seems we 
are first locking the upload state and then locking the client holder (`M2` 
then `M0`).  As part of `UploadPart` we are first locking the client holder and 
then locking the upload state (`M0` then `M2`).  TSAN is reporting this latter 
operation as `M1` then `M2` which is also true, but not as interesting.
   
   A good fix would probably be to release the upload state lock in 
`HandleUploadOutcome` before marking `pending_parts_completed` finished (in 
general it's not a good idea to mark futures finished when holding locks).  I 
want to reproduce and confirm this is actually a problem and this fix actually 
works.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to