John Watson created FLINK-38904:
-----------------------------------

             Summary: MySQL CDC binlog reader hangs due to TLS 1.3 KeyUpdate 
deadlock (potentially JDK-8241239)
                 Key: FLINK-38904
                 URL: https://issues.apache.org/jira/browse/FLINK-38904
             Project: Flink
          Issue Type: Bug
          Components: Flink CDC
         Environment: * JDK 11.0.18 (TLS 1.3 enabled by default)
 * MySQL on AWS RDS with SSL required
 * ~137GB data processed
            Reporter: John Watson


 The MySQL CDC binlog reader deadlocks after processing ~137GB data when using 
TLS 1.3. This appears to be caused by JDK bug 
[JDK-8241239|https://bugs.openjdk.org/browse/JDK-8241239] where TLS 1.3's 
KeyUpdate mechanism triggers a deadlock in SSLSocketImpl.

 

TLS 1.3 sends KeyUpdate messages after ~137GB of data transfer (AES-GCM nonce 
limit). The deadlock occurs as follows:
 * Reader thread receives KeyUpdate, must respond by writing new keys
 * Reader thread holds SSL lock, blocks in native {{socketWrite0()}}
 * Keepalive thread detects timeout, attempts to close connection
 * {{SSLSocketImpl.closeNotify()}} requires the same SSL lock
 * Deadlock: Reader holds lock waiting on network I/O; Keepalive waiting for 
lock

Thread Dump:
{code:java}
  Thread: blc-...:3306 (id=113)
    State: RUNNABLE (blocked in native socketWrite0)
    Holds: ReentrantLock@753cff5d
    Stack:
      java.net.SocketOutputStream.socketWrite0(Native Method)
      sun.security.ssl.SSLSocketOutputRecord.flush()
      sun.security.ssl.OutputRecord.changeWriteCiphers()
      sun.security.ssl.KeyUpdate$KeyUpdateProducer.produce()
      sun.security.ssl.SSLSocketImpl.tryKeyUpdate()
      sun.security.ssl.SSLSocketImpl.decode()
      sun.security.ssl.SSLSocketImpl.readApplicationRecord()
      ...
  Thread: blc-keepalive-...:3306 (id=115)
    State: WAITING
    Waiting on: ReentrantLock@753cff5d                          
    Lock owner: Thread 113                                      
    Stack:
      java.util.concurrent.locks.ReentrantLock.lock()
      sun.security.ssl.SSLSocketImpl.closeNotify()              
      sun.security.ssl.TransportContext.closeNotify()
      sun.security.ssl.SSLSocketImpl.shutdownOutput()
      com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close()
      com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel()
      com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect()
      ...  {code}
 

+Steps to Reproduce+
Configure MySQL CDC with SSL enabled ({{requireSSL=true}}) against AWS Aurora
Use JDK 11 (TLS 1.3 enabled by default)
Process high-volume CDC workload (>137GB)
Observe binlog reader thread deadlock

+Expected Behavior+
Binlog reader continues processing indefinitely without deadlocking.

+Actual Behavior+
Binlog reader deadlocks after ~137GB data transfer when TLS 1.3 KeyUpdate is 
triggered. The reader thread holds the SSL lock while blocked in 
{{socketWrite0()}}, and the keepalive thread blocks forever waiting for the 
same lock to send {{close_notify}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to