jpatra72 opened a new issue, #50188:
URL: https://github.com/apache/arrow/issues/50188

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   In `pyarrow==24.0.0`, a process that creates a `pyarrow.fs.S3FileSystem` and 
keeps a reference to it until interpreter shutdown will deadlock during 
`Py_FinalizeEx`. The hang is in the `ensure_s3_finalized` atexit handler 
registered by `pyarrow/fs.py`, inside 
`Aws::Crt::Io::ClientBootstrap::~ClientBootstrap()`. The same code exits 
cleanly on `pyarrow==23.0.0`.
   
   This is the AWS-CRT "blocking-shutdown" deadlock that aws-sdk-cpp documents 
in [aws/aws-sdk-cpp#2769](https://github.com/aws/aws-sdk-cpp/issues/2769) — but 
it is reachable from pure-Python pyarrow without any explicit S3 I/O, just from 
holding a `S3FileSystem` Python reference past interpreter shutdown. Arrow's 
atexit-driven finalize ordering plus the AWS C/C++ stack version bumps in 
24.0.0 turn this from a quiet teardown into an indefinite hang.
   
   #### Reproducer
   
   ```python
   # bug.py
   import pyarrow.fs
   s3 = pyarrow.fs.S3FileSystem()
   print("S3FileSystem created, exiting...")
   ```
   
   ```
   $ python bug.py
   S3FileSystem created, exiting...
   # process never exits; SIGTERM required
   ```
   
   No S3 request is issued — just constructing `S3FileSystem()` and holding the 
reference is enough.
   
   #### Diagnostic: `del s3` fixes it; `pyarrow.fs.finalize_s3()` does **not**
   
   ```python
   # fixed.py — clean exit
   import pyarrow.fs
   s3 = pyarrow.fs.S3FileSystem()
   print("S3FileSystem created, exiting...")
   del s3
   ```
   
   ```python
   # still_hangs.py — same deadlock, just earlier in the script
   import pyarrow.fs
   s3 = pyarrow.fs.S3FileSystem()
   pyarrow.fs.finalize_s3()   # hangs here instead of at atexit
   ```
   
   That asymmetry pins the cause to "live `S3Client` reference at the moment 
`Aws::ShutdownAPI` runs," not to a missed wakeup or a broken finalizer.
   
   #### gdb backtrace — main thread (the deadlock victim)
   
   ```
   Thread 1 (Thread 0x7ff7151f5740 (LWP 3685) "python"):
   #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
   #1  0x00007ff70fe02671 in 
std::__atomic_futex_unsigned_base::_M_futex_wait_until (...)
                                                     from 
/lib/x86_64-linux-gnu/libstdc++.so.6
   #2  0x00007ff7114d8f01 in Aws::Crt::Io::ClientBootstrap::~ClientBootstrap()
                                                     from 
pyarrow/libarrow.so.2400
   #3  0x00007ff711475b91 in Aws::SetDefaultClientBootstrap(...)
                                                     from 
pyarrow/libarrow.so.2400
   #4  0x00007ff711475bc9 in Aws::CleanupCrt()       from 
pyarrow/libarrow.so.2400
   #5  0x00007ff7114735a5 in Aws::ShutdownAPI(...)   from 
pyarrow/libarrow.so.2400
   #6  0x00007ff710a04a28 in arrow::fs::EnsureS3Finalized()
                                                     from 
pyarrow/libarrow.so.2400
   #7  0x00007ff6fe6693c0 in __pyx_pw_7pyarrow_5_s3fs_7ensure_s3_finalized(...)
                                                     from 
pyarrow/_s3fs.cpython-312-x86_64-linux-gnu.so
   #8  0x000055bdcca34c2c in atexit_callfuncs    at Modules/atexitmodule.c:137
   #9  0x000055bdcca22213 in _PyAtExit_Call       at Modules/atexitmodule.c:157
   #10 Py_FinalizeEx ()                           at Python/pylifecycle.c:1927
   #11 0x000055bdcca30920 in Py_RunMain ()        at Modules/main.c:716
   #12 0x000055bdcc9ea477 in Py_BytesMain (...)   at Modules/main.c:768
   ```
   
   #### gdb backtrace — the AWS CRT event-loop thread
   
   ```
   Thread 66 (Thread 0x7ff661ff0640 (LWP 3751) "AwsEventLoop1"):
   #0  0x00007ff71531feae in epoll_wait (epfd=4, ..., timeout=100000)
                                                     at 
../sysdeps/unix/sysv/linux/epoll_wait.c:30
   #1  0x00007ff71153cf1a in aws_event_loop_thread () from 
pyarrow/libarrow.so.2400
   #2  0x00007ff7115fe639 in thread_fn ()             from 
pyarrow/libarrow.so.2400
   #3  0x00007ff71528eac3 in start_thread (...)       at pthread_create.c:442
   #4  0x00007ff71531fa84 in clone ()                 at clone.S:100
   ```
   
   The `epoll_wait(timeout=100000)` is the standard idle poll interval of 
`aws-c-io`'s Linux epoll event loop ([`DEFAULT_TIMEOUT = 100 * 
1000`](https://github.com/awslabs/aws-c-io/blob/v0.26.3/source/linux/epoll_event_loop.c)).
 It is *not* the bug signal on its own. The real bug signal is that nothing 
wrote to the loop's wake pipe/eventfd to ask it to stop — because the 
underlying `aws_client_bootstrap`'s C-side refcount never reached zero.
   
   #### Root cause analysis
   
   `~ClientBootstrap()` in `aws-crt-cpp` is:
   
   ```cpp
   aws_client_bootstrap_release(m_bootstrap);
   if (m_enableBlockingShutdown) {
       // If your program is stuck here, stop using EnableBlockingShutdown()
       m_shutdownFuture.wait();
   }
   ```
   ([aws-crt-cpp v0.38.0 
source/io/Bootstrap.cpp](https://github.com/awslabs/aws-crt-cpp/blob/v0.38.0/source/io/Bootstrap.cpp))
   
   `Aws::InitAPI()` (called by Arrow's `InitializeS3`) unconditionally calls 
`clientBootstrap->EnableBlockingShutdown()` ([aws-sdk-cpp 1.11.594 
Aws.cpp:90](https://github.com/aws/aws-sdk-cpp/blob/1.11.594/src/aws-cpp-sdk-core/source/Aws.cpp#L90)),
 so the destructor always takes the blocking path.
   
   The libstdc++ frame in our stack is `std::future::wait()` on 
`m_shutdownFuture` — a promise fulfilled by the C-layer's 
`on_shutdown_complete` callback, which only fires once the C 
`aws_client_bootstrap` reaches refcount zero.
   
   The Python reference `s3` keeps `pyarrow._s3fs.S3FileSystem` alive → keeps 
the C++ `shared_ptr<S3Client>` alive → keeps the underlying `aws_s3_client` 
alive → which holds a strong reference on `aws_client_bootstrap`. So when 
`Aws::CleanupCrt()` drops Arrow's default `shared_ptr<ClientBootstrap>` (the 
only thing the C++ wrapper destructor releases is *its own* one C-side ref), 
the C bootstrap still has refcount > 0, `on_shutdown_complete` never fires, and 
the main thread futex-waits indefinitely. Meanwhile the event-loop thread sits 
at the top of its 100 s idle poll, never told to stop.
   
   #### Why 23.0.0 worked and 24.0.0 doesn't
   
   Arrow's `s3fs.cc` finalize code is essentially identical between 23.0.0 and 
24.0.0; the change is in the bundled AWS C/C++ stack 
([`cpp/thirdparty/versions.txt`](https://github.com/apache/arrow/blob/apache-arrow-24.0.0/cpp/thirdparty/versions.txt)):
   
   | component | 23.0.0 | 24.0.0 |
   |---|---|---|
   | aws-crt-cpp | 0.32.8 | **0.38.0** |
   | aws-c-io    | 0.19.1 | **0.26.3** |
   | aws-c-s3    | 0.8.1  | **0.12.0** |
   
   The `EnableBlockingShutdown` + `m_shutdownFuture.wait()` pattern itself has 
not changed across these versions, but the ref-graph / teardown-task scheduling 
in `aws-c-io` between 0.19 and 0.26 changed enough that a 
previously-fast-but-still-incorrect shutdown now reliably deadlocks when any 
`S3Client` reference is alive.
   
   #### Suggested fix directions
   
   1. **Arrow side (preferred):** at the top of `arrow::fs::FinalizeS3` (before 
`Aws::ShutdownAPI`), drop Arrow's internal `S3Client` cache 
(`S3ClientFinalizer`) and ensure no `S3FileSystem`-owned `shared_ptr<S3Client>` 
survives. Today the finalizer does call into `S3ClientFinalizer::Finalize()`, 
but any `S3FileSystem` instance still reachable from Python keeps 
`S3ClientHolder` alive past that point. Either:
      - hold `S3Client` via `weak_ptr` inside `S3FileSystem` and re-resolve 
through the finalizer, or
      - have the atexit handler eagerly walk known `S3FileSystem` instances 
(via a `weak_ptr` registry) and reset their clients before calling 
`ShutdownAPI`.
   2. **Bypass the blocking destructor:** consider whether Arrow needs 
`Aws::InitAPI` defaults at all — building `SDKOptions` with a custom 
`ClientBootstrap` that doesn't call `EnableBlockingShutdown()` would convert 
this from a hang into a benign leak (matching the warning in aws-crt-cpp's own 
source).
   3. **At minimum, document the footgun.** Today there is no docstring warning 
that holding a `pyarrow.fs.S3FileSystem` past interpreter shutdown will hang 
the process on `pyarrow>=24`.
   
   #### Related
   
   - [aws/aws-sdk-cpp#2769 — `ShutdownAPI` hangs when client outlives 
`ShutdownAPI`](https://github.com/aws/aws-sdk-cpp/issues/2769) (closed; 
identical stack signature, root cause documented)
   - [aws-crt-cpp v0.38.0 `~ClientBootstrap` source 
comment](https://github.com/awslabs/aws-crt-cpp/blob/v0.38.0/source/io/Bootstrap.cpp)
 — "If your program is stuck here, stop using EnableBlockingShutdown()"
   - [Arrow PR #38375 (GH-38364)](https://github.com/apache/arrow/pull/38375) — 
atexit registration for `ensure_s3_finalized` (the path that fires the deadlock)
   
   ### Component(s)
   
   Python, C++
   
   ---
   
   *Note: I captured the gdb dump and verified the reproducer (including the 
`del s3` vs `finalize_s3()` asymmetry) on a real environment; the write-up and 
cross-references to aws-sdk-cpp / aws-crt-cpp source were drafted with help 
from Claude. Happy to provide the full gdb log, a Dockerfile repro, or 
additional traces on request.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to