BewareMyPower opened a new pull request #11889:
URL: https://github.com/apache/pulsar/pull/11889


   Fixes #10721 
   
   ### Motivation
   
   Currently the connection timer stops after the TCP connection is established 
immediately (the state becomes `TcpConnected`). However, the connect phase 
should also include sending the `CommandConnect` request and receiving the 
`CommandConnected` response from broker successfully. For example, in a case 
like #10721 described, if the Pulsar broker received a SIGSTOP signal and 
became pending, the TCP connection can be established but no response could be 
received from a stopped broker. Then the client would hang forever.
   
   ### Modifications
   
   Stop the connection timer only after the `ClientConnection`'s state becomes 
`Ready`, which means the client has received the `CommandConnected` response 
successfully. It's also consistent with Java client's implementation, see 
https://github.com/apache/pulsar/blob/235e678a56d0284e68b45e46706b6237d7c6d5f9/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L329-L330
   
   ### Verifying this change
   
   - [ ] Make sure that the change passes the CI checks.
   
   It's hard to simulate the scenario in unit test. But we can simply reproduce 
it in local env.
   1. Run a pulsar standalone and send `SIGSTOP` signal by `pkill -SIGSTOP -f 
pulsar`.
   2. Run any C++/Python client with this fix to connect to the standalone.
   
   ```
   >>> import pulsar
   >>> c = pulsar.Client('pulsar://localhost:6650')
   >>> c.create_producer('xxx')
   2021-09-02 11:54:38.828 INFO  [0x10d72ae00] ClientConnection:181 | [<none> 
-> pulsar://localhost:6650] Create ClientConnection, timeout=10000
   2021-09-02 11:54:38.828 INFO  [0x10d72ae00] ConnectionPool:96 | Created 
connection for pulsar://localhost:6650
   2021-09-02 11:54:38.969 INFO  [0x700003136000] ClientConnection:367 | 
[127.0.0.1:62784 -> 127.0.0.1:6650] Connected to broker
   2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientConnection:532 | 
[127.0.0.1:62784 -> 127.0.0.1:6650] Connection was not established in 10000 ms, 
close the socket
   2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientConnection:572 | 
[127.0.0.1:62784 -> 127.0.0.1:6650] Read failed: Operation canceled
   2021-09-02 11:54:48.973 INFO  [0x700003136000] ClientConnection:1495 | 
[127.0.0.1:62784 -> 127.0.0.1:6650] Connection closed
   2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientImpl:188 | Error 
Checking/Getting Partition Metadata while creating producer on 
persistent://public/default/xxx -- ConnectError
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File 
"/Users/xuyunze/github.com/BewareMyPower/pulsar/pulsar-client-cpp/python/pulsar/__init__.py",
 line 603, in create_producer
       p._producer = self._client.create_producer(topic, conf)
   _pulsar.ConnectError: Pulsar error: ConnectError
   ```
   
   we can see after 10 seconds (the default connect timeout), it failed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to