westonpace opened a new issue, #36523: URL: https://github.com/apache/arrow/issues/36523
### Describe the bug, including details regarding any error messages, version, and platform. I encountered this while stress testing #35440 with TSAN enabled but I believe it's not related to #35440 and, indeed, not a bug at all. I'm attaching the log from TSAN here but I'll give a summary. [tsan_out.txt](https://github.com/apache/arrow/files/11973450/tsan_out.txt) There are three locks involved and so it is quite confusing. M0: The holder lock (`std::mutex S3ClientHolder::mutex_`) M1: The finalizer lock (`std::shared_mutex S3ClientFinalizer::mutex_`) M2: The upload state lock (`std::mutex UploadState::mutex`) TSAN reports `Cycle in lock order graph: M0 => M1 => M2 => M0` The task is "creating a bunch of files in parallel". In the main thread we are calling create dir. This all happens in `arrow::fs::S3FileSystem::Impl::CreateBucket`. As part of this process we: * Obtain Exclusive Lock M0 * Obtain Shared Lock M1 In a CPU thread we are closing a file, which causes us to upload the final part. This all happens as part of `arrow::fs::(anonymous namespace)::ObjectOutputStream::UploadPart`. As part of this process we: * Obtain Shared Lock M1 * Obtain Exclusive Lock M2 In an I/O thread we are handling the upload outcome which causes us to call close on the file. This all happens as part of `arrow::fs::(anonymous namespace)::ObjectOutputStream::HandleUploadOutcome`. As part of this process we: * Obtain Exclusive Lock M2 * Obtain Exclusive Lock M0 However, given that M1 is a shared lock, I'm not sure this deadlock that TSAN is reporting is possible. Unfortunately, I think there is actually a potential deadlock here, just not the one that TSAN is reporting. As part of `HandleUploadOutcome` it seems we are first locking the upload state and then locking the client holder (`M2` then `M0`). As part of `UploadPart` we are first locking the client holder and then locking the upload state (`M0` then `M2`). TSAN is reporting this latter operation as `M1` then `M2` which is also true, but not as interesting. A good fix would probably be to release the upload state lock in `HandleUploadOutcome` before marking `pending_parts_completed` finished (in general it's not a good idea to mark futures finished when holding locks). I want to reproduce and confirm this is actually a problem and this fix actually works. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
