Hello, Thanks David to propose a new PR [1] for KIP-498 [2] to address KAFKA-4090 [3].
I find it interesting, I wonder how it stands w.r.t. avoiding inter-layer violation. [1] https://github.com/apache/kafka/pull/8066 [2] https://issues.apache.org/jira/browse/KAFKA-4090 [3] https://cwiki.apache.org/confluence/display/KAFKA/KIP-498%3A+Add+client-side+configuration+for+maximum+response+size+to+protect+against+OOM Le mar. 6 août 2019 à 15:37, Gokul Ramanan Subramanian <gokul24...@gmail.com> a écrit : > > Hi Alexandre. > > Thanks for this analysis. > > IMHO, there are 4 ways to ago about this: > > 1. We don't fix the bug directly but instead update the Kafka documentation > telling clients to configure themselves correctly - Silly but easy to > achieve. > 2. We adopt Stanislav's solution that fixes the problem - Easy to achieve, > potentially adds inflexibility in the future if we want to change handshake > protocol. However, changing the handshake protocol is going to be a > backwards incompatible change anyway with or without Stanislav's solution. > 3. We adopt Alexandre's solution - Easy to achieve, but has shortcomings > Alexandre has highlighted. > 4. We pivot KIP-498 to focus on standardizing the handshake protocol - Not > easy to achieve, since this will be a backwards incompatible change and > will require client and server redeployments in correct order. Further, > this can be a hard problem to solve given that various transport layer > protocols have different headers. In order for the "new handshake" protocol > to work, it would have to work in the same namespace as those headers, > which will require careful tuning of handshake constants. > > Any thoughts from the community on how we can proceed? > > Thanks. > > On Tue, Aug 6, 2019 at 12:41 PM Alexandre Dupriez < > alexandre.dupr...@gmail.com> wrote: > > > Hello, > > > > I wrote a small change [1] to make clients validate the size of messages > > received from a broker at the protocol-level. > > However I don't like the change. It does not really solve the problem which > > originally motivated the KIP - that is, protocol mismatch (plaintext to SSL > > endpoint). More specifically, few problems I can see are listed below. This > > is a non-exhaustive list. They also have been added to KIP-498 [2]. > > > > 1) Incorrect failure mode > > As was experimented and as can be seen as part of the integration tests, > > when an invalid size is detected and the exception InvalidReceiveException > > is thrown, consumers and producers keeps retrying until the poll timeout > > expires or the maximum number of retries is reached. This is incorrect if > > we consider the use case of protocol mismatch which motivated this change. > > Indeed, producers and consumers would need to fail fast as retries will > > only prolong the time to remediation from users/administrators. > > > > 2) Incomplete remediation > > The proposed change cannot provide an definite guarantee against OOM in all > > scenarios - for instance, it will still manifest if the maximum size is set > > to 100 MB and the JVM is under memory pressure and have less than 100 MB of > > allocatable memory. > > > > 3) Illegitimate message rejection > > Even worse: what if the property is incorrectly configured and prevent > > legitimate messages from reaching the client? > > > > 4) Unclear configuration parameter > > 4.a) The name max.response.size intends to mirror the existing > > max.request.size from the producer's configuration properties. However, > > max.request.size intends to check the size of producer records as provided > > by a client; while max.response.size is to check the size directly decoded > > from the network according to Kafka's binary protocol. > > 4.b) On the broker, the property socket.request.max.bytes is used to > > validate the size of messages received by the server. The new property > > serves the same purpose, which introduces duplicated semantic, even if one > > property is characterised with the keyword "request" and the other with > > "response", in both cases reflecting the perspective adopted from either a > > client or a server. > > > > Please let me know what you think. An alternative mitigation may be worth > > investigated for the detection of protocol mismatch in the client. > > > > [1] https://github.com/apache/kafka/pull/7160 > > [2] > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-498%3A+Add+client-side+configuration+for+maximum+response+size+to+protect+against+OOM > > > > Le jeu. 1 août 2019 à 09:42, Alexandre Dupriez < > > alexandre.dupr...@gmail.com> > > a écrit : > > > > > Thanks Gokul and Stanislav for your answers. > > > > > > I went through the PR 5940 [1]. Indeed Gokul, your reasoning echoes the > > > observations of Ismael about a potential inter-protocol layering > > violation. > > > > > > As you said Stanislav, the server-side SSL engine responds with an alert > > > with code 80 (internal_error) from what I saw when reproducing the OOM. > > > Since the Alert is generated below the application layer, I am not sure > > > what could be done on the broker to handle the scenario more gracefully. > > > Interestingly, the SSL engine emits the possibility of receiving a > > > plaintext message in debug logs [2]. > > > > > > The idea was indeed to perform a simple check on the message size decoded > > > in NetworkReceive against a configurable value, and throw > > > an InvalidReceiveException in a similar fashion as what happens on the > > > broker, where a non-unlimited maxSize can be provided. Basically, mimic > > the > > > behaviour on the broker. The default value for the maximal request size > > on > > > the broker is 100 MB which you are suggesting to use client-side. > > > > > > Adding a client configuration property for clients may be an overkill > > > here. What I am going to ask is naive but - is it theoretically possible > > > for the broker to legitimately send responses over 100 MB in size? > > > > > > Thanks, > > > Alexandre > > > > > > [1] https://github.com/apache/kafka/pull/5940 > > > [2] > > > > > > kafka-network-thread-0-ListenerName(SSL)-SSL-4, fatal error: 80: problem > > > unwrapping net record > > > > > > javax.net.ssl.SSLException: Unrecognized SSL message, plaintext > > connection? > > > > > > kafka-network-thread-0-ListenerName(SSL)-SSL-4, SEND TLSv1.2 ALERT: > > fatal, > > > description = internal_error > > > > > > kafka-network-thread-0-ListenerName(SSL)-SSL-4, WRITE: TLSv1.2 Alert, > > > length = 2 > > > > > > kafka-network-thread-0-ListenerName(SSL)-SSL-4, called closeOutbound() > > > > > > kafka-network-thread-0-ListenerName(SSL)-SSL-4, closeOutboundInternal() > > > > > > kafka-network-thread-0-ListenerName(SSL)-SSL-4, called closeInbound() > > > > > > kafka-network-thread-0-ListenerName(SSL)-SSL-4, fatal: engine already > > > closed. Rethrowing javax.net.ssl.SSLException: Inbound closed before > > > receiving peer's close_notify: possible truncation attack? > > > > > > [Raw write]: length = 7 > > > > > > 0000: 15 03 03 00 02 02 50 ......P > > > > > > > > > Le jeu. 1 août 2019 à 08:50, Stanislav Kozlovski <stanis...@confluent.io > > > > > > a écrit : > > > > > >> Hey Alexandre, thanks for the KIP! > > >> > > >> I had personally hit the same problem and found it very annoying. > > >> Have you considered detecting such a scenario in the client and simply > > >> throwing an exception, rather than allocating any memory? > > >> I have an open PR that does just that - > > >> https://github.com/apache/kafka/pull/5940 > > >> <https://github.com/apache/kafka/pull/5940/files> > > >> > > >> Best, > > >> Stanislav > > >> > > >> On Wed, Jul 31, 2019 at 10:35 AM Gokul Ramanan Subramanian < > > >> gokul24...@gmail.com> wrote: > > >> > > >> > Hi Alex. > > >> > > > >> > A rejected alternative is to try to do SSL handshake from the > > plaintext > > >> > transport layer. This couples various transport layers and is > > >> inflexible to > > >> > adding new transport layers in the future. Further, if the plaintext > > >> > transport layer does SSL handshake, it will always get an error, > > >> > irrespective of whether or not the peer supports SSL. Therefore, the > > >> > plaintext transport layer would have to determine if the peer uses SSL > > >> or > > >> > not based on the type of error it gets back from the SSL handshake. > > This > > >> > fits right into the general anti-pattern of using exceptions as > > control > > >> > flow mechanisms. > > >> > > > >> > Another rejected alternative would be to have a single byte at the > > >> > transport layer represent the security protocol in use. This byte > > would > > >> > have to be present in every single message between clients and brokers > > >> and > > >> > between brokers and brokers. This change is backwards-incompatible and > > >> adds > > >> > overhead to Kafka. It is likely never going to get accepted by the > > >> > community. > > >> > > > >> > In the absence of a fool-proof way to do a handhskake across all > > >> security > > >> > protocols (plaintext, SSL, other future ones), I think that the > > proposed > > >> > KIP provides a good solution. Therefore, you have my +1. (Not sure if > > >> my +1 > > >> > counts, since I am a Kafka noob). > > >> > > > >> > Thanks. > > >> > Gokul Ramanan Subramanian > > >> > Senior SDE, Amazon AWS > > >> > > > >> > On 28/Jul/19 05:43:19PM +0100, Alexandre Dupriez wrote: > > >> > Hello, > > >> > > > >> > I have created the KIP-498 > > >> > < > > >> > > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-498%3A+Add+client-side+configuration+for+maximum+response+size+to+protect+against+OOM > > >> > > > > >> > to > > >> > propose a minor addition to the producer configuration properties, in > > >> > order > > >> > to protect against OOM when the response message decoded by a client > > >> > indicates an "abnormally high" size to be allocated. > > >> > > > >> > This follows this discussion thread > > >> > <https://www.mail-archive.com/dev@kafka.apache.org/msg55658.html> > > and > > >> is > > >> > tracked by KAFKA-4090 < > > >> https://issues.apache.org/jira/browse/KAFKA-4090>. > > >> > > > >> > Please let me know what you think. I created this KIP even though the > > >> > change seems minimal because it affects client configuration, which > > is > > >> > mentioned as a type of change eligible for a KIP on this main wiki > > page > > >> > < > > >> > > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals > > >> > > > > >> > . > > >> > > > >> > Thanks, > > >> > Alexandre > > >> > > > >> > > > > > > > > -- > GOKUL R. SUBRAMANIAN, > Software Engineer