zbentley commented on issue #127: URL: https://github.com/apache/pulsar-client-python/issues/127#issuecomment-1572093495
Thanks @BewareMyPower. Here are the logs from a run (MacOS 11, Python 3.10.9, client 3.1.0) that got stuck with 4 processes: ``` Joining pool Joined pool 2023-06-01 09:40:21.098 INFO [0x104424580] ProducerImpl:697 | Producer - [persistent://chariot1/chariot_ns_sre--kms_test/chariot_topic_kms_test-partition-1, standalone-36-796] , [batching = off] 2023-06-01 09:40:21.098 INFO [0x104424580] ClientConnection:1600 | [[::1]:53830 -> [::1]:6650] Connection closed with ConnectError 2023-06-01 09:40:21.099 INFO [0x104424580] ClientConnection:269 | [[::1]:53830 -> [::1]:6650] Destroyed connection 2023-06-01 09:40:21.357 INFO [0x104514580] ClientConnection:190 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000 2023-06-01 09:40:21.357 INFO [0x104514580] ConnectionPool:97 | Created connection for pulsar://localhost:6650 2023-06-01 09:40:21.359 INFO [0x16baf3000] ClientConnection:388 | [[::1]:53831 -> [::1]:6650] Connected to broker 2023-06-01 09:40:21.375 INFO [0x16baf3000] HandlerBase:72 | [persistent://chariot1/chariot_ns_sre--kms_test/chariot_topic_kms_test-partition-1, ] Getting connection from pool 2023-06-01 09:40:21.386 INFO [0x16baf3000] ProducerImpl:202 | [persistent://chariot1/chariot_ns_sre--kms_test/chariot_topic_kms_test-partition-1, ] Created producer on broker [[::1]:53831 -> [::1]:6650] Destroying connections 2023-06-01 09:40:21.402 INFO [0x104514580] ProducerImpl:697 | Producer - [persistent://chariot1/chariot_ns_sre--kms_test/chariot_topic_kms_test-partition-1, standalone-36-797] , [batching = off] 2023-06-01 09:40:21.402 INFO [0x104514580] ClientConnection:1600 | [[::1]:53831 -> [::1]:6650] Connection closed with ConnectError Destroying connections 2023-06-01 09:40:21.403 INFO [0x104514580] ProducerImpl:697 | Producer - [persistent://chariot1/chariot_ns_sre--kms_test/chariot_topic_kms_test-partition-1, standalone-36-797] , [batching = off] Destroying connections 2023-06-01 09:40:21.403 INFO [0x104514580] ClientConnection:1600 | [[::1]:53831 -> [::1]:6650] Connection closed with ConnectError 2023-06-01 09:40:21.403 INFO [0x104514580] ProducerImpl:697 | Producer - [persistent://chariot1/chariot_ns_sre--kms_test/chariot_topic_kms_test-partition-1, standalone-36-797] , [batching = off] 2023-06-01 09:40:21.403 INFO [0x104514580] ClientConnection:1600 | [[::1]:53831 -> [::1]:6650] Connection closed with ConnectError Destroying connections Destroyed connections Destroyed connections Destroyed connections Joining pool ``` An `lldb` backtrace is attached to this comment. It looks slightly different than the `py-spy` backtrace I provided from a Linux host in production, but shows similar defective behavior. The problem largely appears to be that *fork-safe programs that use threads must assume those threads may vanish without informing the rest of the program* (that's what pthread_atfork(3) is for). When a threaded program forks-without-execcing, only the thread calling fork(2) exists in the child. All of the other threads vanish, in the midst of whatever they were doing. To be fair, this is [documentedly unsafe behavior according to POSIX](http://www.doublersolutions.com/docs/dce/osfdocs/htmls/develop/appdev/Appde193.htm), but it's also the everyday reality for the most common Python application harnesses in the world. Most Python isn't single-threaded, nor is it single-process. As a result, drivers loaded by Python programs must assume those programs may fork, at which point threads will vanish. While this is water far under the bridge at this point for `pulsar-client`, those realities are one of the reasons why multithreaded drivers are often a problematic design. Since drivers have to work in "hostile environments" (embedded interpreters, forking code, thread-constrained, under-resourced, driver code invoked from signal handlers or atfork hooks, etc). Using multithreading inside a client library might be safe in languages that tend to work in a more uniform "the runtime is the entry point" way, like Go and Java, but in languages like Python that are often run in weird ways and/or messed up environments, it can cause problems. The more robust drivers I've used eschew multithreading internally, even at the cost of more complex usage APIs for end users (e.g. user code must assume the responsibility for "turning" the driver event loop and/or performing heartbeat pings). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
