[DISCUSS] Brokers disconnect intermittently with TLS1.3

2021-11-16 Thread Kokoori, Shylaja
Hi Luke,



Sorry about the miscommunication, I was not talking about making TLS1.2 
default. My assumption is that if JDK version < 11, TLS 1.2 will be used. So 
wanted to come up with a solution that worked for both cases.



To provide more details about the issue, given below is the error reported in 
the kafka.log. With a record size of 500KB it is easily reproducible.

Scenario when this happens,

in the read function (in SSLTransportLayer.java), 
unwrapResult.getHandshakeStatus==NEED_WRAP, unwrapResult.status=STATUS_OK

causing to throw the renegotiation exception.



A simple test I did was to turn off the renegotiation exception and I did not 
see the disconnect messages in the log and the intermittent latency spike.



Thank you,

Shylaja





ERROR [SslTransportLayer channelId=1 
key=channel=java.nio.channels.SocketChannel[connection-pending remote=/:9093], 
mailto:selector=sun.nio.ch.EPollSelectorImpl@29fc22ba, interestOps=8, 
readyOps=0] Renegotiation requested, but it is not supported, channelId 1, 
appReadBuffer pos 0, netReadBuffer pos 40, netWriteBuffer pos 147 
handshakeStatus NEED_WRAP (org.apache.kafka.common.network.SslTransportLayer)

[2021-10-05 21:03:40,042] INFO [ReplicaFetcher replicaId=0, leaderId=1, 
fetcherId=1] Error sending fetch request (sessionId=530771171, epoch=237174) to 
node 1: (org.apache.kafka.clients.FetchSessionHandler)

java.io.IOException: Connection to 1 was disconnected before the response was 
read

at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)

at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:110)

at 
kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:217)

at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:317)

at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:141)

at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:140)

at scala.Option.foreach(Option.scala:437)

at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:140)

at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:123)

at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)

[2021-10-05 21:03:40,042] WARN [ReplicaFetcher replicaId=0, leaderId=1, 
fetcherId=1] Error in response for fetch request (type=FetchRequest, 
replicaId=0, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData={test_topic0-5=PartitionData(fetchOffset=267151, logStartOffset=0, 
maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional[0])}, isolationLevel=READ_UNCOMMITTED, toForget=, 
metadata=(sessionId=530771171, epoch=237174), rackId=) 
(kafka.server.ReplicaFetcherThread)

java.io.IOException: Connection to 1 was disconnected before the response was 
read

at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)

at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:110)

at 
kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:217)

at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:317)

at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:141)

at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:140)

at scala.Option.foreach(Option.scala:437)

at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:140)


[DISCUSS] Brokers disconnect intermittently with TLS1.3

2021-11-15 Thread Kokoori, Shylaja
Hi all,

Using TLS1.3 (with JDK11) is causing an intermittent increase in inter-broker 
p99 latency, as mentioned by Yiming in 
Kafka-9320.
 We tested this with Kafka 2.8.
The issue seems to be because of a renegotiation exception being thrown by

read(ByteBuffer dst)

&

write(ByteBuffer src)

in
clients/src/main/java/org/apache/kafka/common/network/SslTransportLayer.java

This exception is causing the connection to close between the brokers before 
read/write is completed.

In our internal experiments we have seen the p99 latency stabilize when we 
remove this exception.

Given that TLS1.3 does not support renegotiation, can I make it applicable just 
for TLS1.2?

I have also created a ticket

Any feedback is welcome.

Thank you,

Shylaja