Hi Alieh, Yes, I think you understand my intent for the prepare() method.
Thanks, Greg On Mon, Jul 22, 2024 at 2:54 AM Alieh Saeedi <asae...@confluent.io.invalid> wrote: > Hi Greg, > > > I appreciate your concerns and comprehensive answer. > > > I am not sure whether I fully understood what you meant or not. You mean, > at the end, the user can go for one of the following scenarios: Either > > 1) `beginTxn()` and `send(record)` and `commitTxn()` or > > 2) `beginTxn()` and `prepare(record)` and `send(prepared_record)` and > `commitTxn()` ? > > > Of course, the `send` in scenario 1 is different from the one in scenario > 2, since a part of the second one 's job has been done during > `prepare()`ing. > > > Cheers, > > Alieh > > On Sat, Jul 20, 2024 at 1:20 AM Greg Harris <greg.har...@aiven.io.invalid> > wrote: > > > Hi Artem and Matthias, > > > > > On the other hand, the effort to prove that > > > keeping all records in memory won't break some scenarios (and generally > > > breaking one is enough to cause a lot of pain) seems to be > significantly > > > higher than to prove that setting some flag in some API has pretty > much 0 > > > chance of regression > > > > > in the end, why buffer records twice? > > > > > This way we don't > > > ignore the error, we're just changing the method they are delivered. > > > > > Very clean semantics > > > which should also address the concern of "non-atomic tx" > > > > I feel like my concerns are being minimized instead of being addressed > > in this discussion, and if that's because I'm not expressing them > clearly, > > I apologize. > > > > Many users come to Kafka with prior expectations, especially when we use > > industry-standard terminology like 'Exactly Once Semantics", > > "Transactions", "Commit", "Abort". Of course Kafka isn't an > ACID-compliant > > database, but users will evaluate, discuss, and develop applications with > > Kafka through the lens of the ACID principles, because that is the > > framework most commonly applied to transactional semantics. > > The original design of KIP-98 [1] explicitly mentions atomic commits > (with > > the same meaning as the A in ACID) as the primary abstraction being added > > (reproduced here): > > > > > At the core, transactional guarantees enable applications to produce to > > multiple TopicPartitions atomically, ie. all writes to these > > TopicPartitions will succeed or fail as a unit. > > > Further, since consumer progress is recorded as a write to the offsets > > topic, the above capability is leveraged to enable applications to batch > > consumed and produced messages into a single atomic unit, ie. a set of > > messages may be considered consumed only if the entire > > ‘consume-transform-produce’ executed in its entirety. > > > > I think it's important to say that to a user, "writes" really means > "send() > > and commitOffsets() calls", not literal produce requests to Kafka > brokers, > > and "consume-transform-produce" really means "poll(), transform, send()". > > This is because to a user, the implementation within poll() and send() > and > > the broker are none of their concern, and are intended to be within the > > abstraction. > > When I say that this feature is a non-atomic commit, I mean that this > > feature does not fit the above description, and breaks the transaction > > abstraction in a meaningful way. No longer do all writes succeed or fail > as > > a unit, some failures are permitted to drop data. No longer must a > > consume-transform-produce cycle be executed in its entirety, some parts > may > > be left incomplete. > > This means that this method will be difficult to define ("which > exceptions > > are covered?"), difficult to document ("how do we explain > > 'not-really-atomic commits' clearly and unambiguously to a potential > > user?"), and difficult to compose ("if someone turns this option on, how > > does that affect delivery guarantees and opportunities for bugs in > > upper layers?"). > > Users currently rely heavily on analogies to other database systems to > make > > sense of Kafka's transactions, and we need to use that to our benefit, > > rather than designing in spite of it being true. > > > > However this atomicity guarantee isn't always desirable, as evidenced by > > the original bug report [2]. If you're interacting with a website form > for > > example, and a database transaction fails because one of your strings is > > oversize, you don't need to re-input all of your form responses from > > scratch, as there is an application layer/browser in-between to preserve > > the state and retry the transaction. > > And while you could make a convenience/performance/etc argument in that > > situation ("The database should truncate/null-out the oversize string") > and > > modern databases often have very expressive DML that would permit such a > > behavior (try-catch, etc), the End-to-End arguments [3] make me believe > > that is a bad design and should be discouraged. > > To that end, I was suggesting ways to push this farther and farther up > the > > stack, such as performing record size estimation. This doesn't mean that > it > > can't be added at a low level of abstraction, just that we need to make > > sure to exhaust all other alternatives, and justify it with a performance > > benefit. > > > > I was holding off on discussing the literal design until you provided > > concrete performance justification, but to progress the discussion while > > i'm waiting for that, I can give my thoughts: > > > > I don't think an overloaded send() method is appropriate given that this > > appears to be a niche use-case, and the send() method itself is probably > > the single most important method in the Clients library. The KIP-98 > design > > was a much more substantial change to the Producer than this KIP, and it > > found a way to preserve the original type signature (but added an > > exception). > > Users picking up the Producer for the first time may see this additional > > method, and may spend time trying to understand whether it is something > > suitable for their use-case. In the best case, they ignore it and use the > > other two signatures. But it is also possible that they will use it > without > > understanding it, and have unexpected data loss. Overall, this feels > like a > > net negative to the producer user-base. > > Also boolean, enum, vararg, flags, etc don't follow the current trend of > > the AdminClient passing Options DTOs for modifying individual API calls, > > which is a much more extensible design. > > > > If there was a way of preparing a record before calling send() that did > > some or all of the pre-send validation (serialization, size checking, > maybe > > topic-partition existence existence, authorization, etc) that could be a > > reasonable way to emit these sorts of errors in a way that makes it clear > > that sending hasn't started and the record is not part of the transaction > > yet. I'm imagining something like: > > > > ``` > > public abstract class PreparedRecord<K, V> extends ProducerRecord<K, V> > { } > > // Actual instantiated class and methods could be an implementation > detail. > > interface Producer { > > Optional<PreparedRecord<K, V>> prepare(ProducerRecord<K, V> record, > > PrepareOptions options) throws SerializationException, > > RecordTooLargeException; > > } > > > > // Application code: > > Optional<PreparedRecord<byte[], byte[]>> prepared = Optional.empty(); > > try { > > prepared = producer.prepare(record, null); > > } catch (Exception e) { > > if (critical(e)) { > > producer.abortTransaction(); > > return; > > } > > } > > if (prepared.isPresent()) { > > producer.send(prepared.get()); // errors will be propagated via > > commitTransaction() and also abort the transaction. > > } > > ``` > > > > Semantically then, it would make sense that the data which has been > > "prepared to send" but has not been "sent" is intentionally left out of > the > > batch by the application control-flow, not an option passed into the > > producer changing the control-flow within the Producer. If passed a > > prepared record, the send method can then be responsible only for waiting > > for sufficient buffer capacity, reusing the work done by the prepare() > > method. > > This prepare method could also enable other use-cases, like parallel > > serialization, exception-less handling, or more advanced > buffering/balking > > behavior ("refuse to prepare records that can't be sent without > blocking"). > > And because the synchronous processing is made explicit, it's much easier > > to communicate to users which exceptions are covered and can be prevented > > from causing transaction aborts. > > If someone uses the method, or doesn't use it, they aren't put at risk of > > compromising their delivery guarantees unexpectedly. > > > > Thanks, > > Greg > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging > > [2] https://issues.apache.org/jira/browse/KAFKA-15259 > > [3] https://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf > > > > On Fri, Jul 19, 2024 at 12:39 PM Matthias J. Sax <mj...@apache.org> > wrote: > > > > > For catching client side errors this would work IMHO. I am ok with > this. > > > > > > We throw before we add the record to the batch. Very clean semantics > > > which should also address the concern of "non-atomic tx"... The > > > exception clearly indicates that the record was not added to the TX, > and > > > users can react to it accordingly. > > > > > > We did discuss this idea previously, but did not have a good proposal > to > > > make it backward compatible. The newly proposed overload would address > > > this issue of backward compatibility. > > > > > > Of course, it might not make it easily extensible in the future for > > > broker side errors, but it's unclear anyway right now, if we would even > > > get to a solution for broker side errors or not -- so maybe it's ok to > > > accept this and drop/ignore the broker side error question for now. > > > > > > > > > > > > A small follow up thought/question: instead of using a boolean, would > we > > > actually want to make it a var-arg enum to allow users to enable this > > > for certain errors explicitly and individually? Beside the added > > > flexibility and fine grain control, a var-arg enum would also make the > > > API nicer/cleaner IMHO compare to a boolean. > > > > > > For convenience, this enum could have an additional `ALL` option (and > we > > > would call out that if `ALL` is used, new error types might be added in > > > future release making the code less safe/robust -- ie, use at your own > > > risk only) > > > > > > This way, we also explicitly document what exception might be thrown in > > > the KIP, as we would add an enum for each error type explicitly, and > > > also make if future proof for new error types we want to cover -- each > > > addition would require a KIP to extend the enum. > > > > > > > > > > > > -Matthias > > > > > > > > > On 7/18/24 10:33 PM, Artem Livshits wrote: > > > > Hey folks, > > > > > > > > Hopefully not to make this KIP go for another spin :-), but I thought > > of > > > a > > > > modification that might actually address safety concerns over using > > flags > > > > to ignore a vaguely specified class of errors. > > > > > > > > What if we had the following overload of .send method: > > > > > > > > void send(ProducerRecord record, Callback callback, boolean > > > > throwImmediately) > > > > > > > > and if throwImmediately=false, then we behave the same way as now > > (return > > > > errors via Future and poison transaction) and if > throwImmediately=true > > > then > > > > we just throw errors immediately from the send function. This way we > > > don't > > > > ignore the error, we're just changing the method they are delivered. > > > Then > > > > KStreams can catch the error for send(record, callback, true) and do > > > > whatever it needs to do. > > > > > > > > -Artem > > > > > > > > > > > > On Mon, Jul 15, 2024 at 4:30 PM Greg Harris > > <greg.har...@aiven.io.invalid > > > > > > > > wrote: > > > > > > > >> Matthias, > > > >> > > > >> Thank you for rejecting my suggested alternatives. Your responses > are > > > the > > > >> sorts of things I expected to see summarized in the text of the KIP. > > > >> > > > >> I agree with most of your rejections, except this one: > > > >> > > > >>> "Estimation" is not sufficient, but we would need to know it > exactly. > > > >>> And that's an impl detail, given that the message format could > change > > > >>> and we could add new internal fields increasing the message size. > > > >> > > > >> An estimate is certainly going to have an error. But an estimate > > > shouldn't > > > >> be treated as exact anyway, there should be an error bound, or > "safety > > > >> factor" used when interpreting it. For example, if the broker side > > > limit is > > > >> 1MB, and an estimate could be wrong by ~10%, then computing an > > estimate > > > and > > > >> dropping records >900kb should be sufficient to prevent RTLEs. > > > >> This is the sort of estimation that I would expect application > > > developers > > > >> could implement, without knowing the exact serialization and > protocol > > > >> overhead. This could prevent user-originated oversize records from > > > making > > > >> it to the producer. > > > >> > > > >> Thanks, > > > >> Greg > > > >> > > > >> > > > >> On Mon, Jul 15, 2024 at 4:08 PM Matthias J. Sax <mj...@apache.org> > > > wrote: > > > >> > > > >>> I agree with Alieh and Artem -- in the end, why buffer records > twice? > > > We > > > >>> effectively want to allow to push some error handling (which I btw > > > >>> consider "business logic") into the producer. IMHO, there is > nothing > > > >>> wrong with it. Dropping a poison pill record is no really a > violation > > > of > > > >>> atomicity from my POV, but a business logic decision to not > include a > > > >>> record in a transaction -- the proposed API just makes it much > > simpler > > > >>> to achieve this business logic goal. > > > >>> > > > >>> > > > >>> > > > >>> For memory size estimation, throughput or message size is actually > > not > > > >>> relevant, right? We would need to look at producer buffer size, ie, > > > >>> `batch.size`, `max.in.flight.request.per.connection` and > guesstimate > > > the > > > >>> number of connections there might be? At least for KS, we don't > need > > to > > > >>> buffer everything until commit, but only until we get a successful > > > "ack" > > > >>> back. > > > >>> > > > >>> Note that KS application not only need to write to (a single) user > > > >>> result topic, but multiple output topics, as well as repartition > and > > > >>> changelog topics, across all tasks assigned to a thread (ie, > > producer), > > > >>> which can easily be 10 tasks or more. If we assume topics with 30 > > > >>> partitions (topics with 50 or more partitions are not uncommon > > either), > > > >>> and a producer who must write to 10 different topics, the number of > > > >>> required connections is very quickly very high, and thus the > required > > > >>> "application buffer space" would be significant. > > > >>> > > > >>> > > > >>> > > > >>> Your others ideas seems not to be viable alternatives: > > > >>> > > > >>>> Streams users that specifically want to drop oversize records can > > > >>>> estimate the size of their data and drop records which are too > > > >>>> large, enforcing their own limits that are lower than the Kafka > > > limits. > > > >>> > > > >>> "Estimation" is not sufficient, but we would need to know it > exactly. > > > >>> And that's an impl detail, given that the message format could > change > > > >>> and we could add new internal fields increasing the message size. > The > > > >>> idea to add some `producer.serializedRecordSize()` helper method > was > > > >>> discussed, but it's a very ugly API and clumsy to use -- also, the > > user > > > >>> code would need to know the producer config which it might not have > > > >>> access to (as it might get passed in from some config file; and it > > > might > > > >>> also be changed). > > > >>> > > > >>> Some other alternative we also discussed was, to let `send()` throw > > an > > > >>> exception for a "record too large" case directly. However, this > > > solution > > > >>> raises backward compatibly concerns, and it might also not help us > to > > > >>> extend the solution in the future (eg, tackle broker side errors). > So > > > we > > > >>> discarded this idea. > > > >>> > > > >>> > > > >>> > > > >>>> Streams users that want CONTINUE semantics can use at_least_once > > > >>>> semantics > > > >>> > > > >>> Not really. EOS is mainly about not having duplicates in the > result, > > > but > > > >>> at-least-once cannot provide this guarantee. (Even if I repeat my > > self: > > > >>> but dropping a poison pill record based on a business logic > decision > > is > > > >>> not data loss, but effectively a business logic filter...) > > > >>> > > > >>> > > > >>> > > > >>>> Streams itself can store record hashes/coordinates and fast rewind > > to > > > >>>> the end of the last transaction, recomputing data rather than > > storing > > > >> it. > > > >>> > > > >>> Given the very complex nature of topologies, with joins, > > aggregations, > > > >>> flatmaps etc, this is a 100x more complex solution and not viable > in > > > >>> practice. > > > >>> > > > >>> > > > >>> > > > >>>> Streams can define exactly_once + CONTINUE semantics to permit the > > > >> whole > > > >>>> transaction to be dropped, because it would allow the next batch > to > > be > > > >>>> started processing. > > > >>> > > > >>> Would this not be much worse? I have a single poison pill record > and > > > >>> would need to drop a full tx (this could be tens of thousands of > > > >>> records...). Also, given that KS write into changelog topic in the > > same > > > >>> TX, this could break the whole application. > > > >>> > > > >>> > > > >>> > > > >>>> Streams can emit records with both a transactional and > > > >> non-transactional > > > >>>> producer if some records are not critical-path > > > >>> > > > >>> We (1) already have a "too many connections" problem with KS so > using > > > >>> move clients is something we try to avoid (and we actually hope to > > > >>> reduce the number of client and connection mid to long term), (2) > > this > > > >>> would be very hard to express at the API level to the user, and (3) > > it > > > >>> would provide very weird semantics. > > > >>> > > > >>> > > > >>> > > > >>>> they should optimize for smaller transactions, > > > >>> > > > >>> IMHO, this would not work in practice because transaction have a > high > > > >>> overhead and commit-interval is used to tradeoff throughput vs > > > >>> end-to-end latency. Given certain throughput requirement, it would > > not > > > >>> be possible to just use a lower commit interval to reduce memory > > > >>> requirements. > > > >>> > > > >>> > > > >>> > > > >>> -Matthias > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> On 7/15/24 2:25 PM, Artem Livshits wrote: > > > >>>> Hi Greg, > > > >>>> > > > >>>>> This makes me think that this IGNORE_SEND_ERRORS covers an > > arbitrary > > > >> set > > > >>>> of error conditions that may be expanded in the future, possibly > to > > > >> cover > > > >>>> the broker side RecordTooLargeException. > > > >>>> > > > >>>> I don't think it contradicts what I said (the keyword here is "in > > the > > > >>>> future") -- with the current functionality, the correct way to > > handle > > > >>> RTLE > > > >>>> is by only letting the client ignore client-originated RTLE (this > > can > > > >> be > > > >>>> easily implemented on the client side). In the future, we can > > improve > > > >> on > > > >>>> that by making the broker return a different exception for > > > >>> batch-too-large > > > >>>> case, then the producer would be able to return broker side > > exceptions > > > >> as > > > >>>> well (and if the application chooses to ignore it -- it will be > able > > > >> to, > > > >>>> but it would be an explicit choice rather than ignoring it by > > > mistake), > > > >>> in > > > >>>> this case the producer client would encapsulate backward > > compatibility > > > >>>> logic when it connects to older brokers to make sure the the > > > >> application > > > >>>> doesn't accidentally gets RTLE originated by the old broker. This > > > >>>> functionality is obviously more involved and we'll need to see if > > > going > > > >>> all > > > >>>> the way is justified, but the partial client-only solution doesn't > > > >> close > > > >>>> the door. > > > >>>> > > > >>>> So one way to look at the current situation is the following: > > > >>>> > > > >>>> 1. We can do a low effort partial solution to solve a real > existing > > > >>>> problem. We can easily prove that it would do exactly what it > needs > > > to > > > >>> do > > > >>>> with minimal risk of regression. > > > >>>> 2. We have a path to a more comprehensive solution, so if we > justify > > > >> the > > > >>>> effort required for that, we can get there. > > > >>>> > > > >>>> BTW, as a side note (I think a saw a question in the thread), we > do > > > try > > > >>> to > > > >>>> introduce error categories here > > > >>>> > > > >>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1050%3A+Consistent+error+handling+for+Transactions > > > >>>> so eventually we may have a better classification for the errors. > > > >>>> > > > >>>>> "if a streams producer is producing 1MB/s, and the commit > interval > > is > > > >> 1 > > > >>>> hour, I expect 3600MB of additional heap needed ... > > > >>>> > > > >>>> Agree, that would be ideal. On the other hand, the effort to > prove > > > >> that > > > >>>> keeping all records in memory won't break some scenarios (and > > > generally > > > >>>> breaking one is enough to cause a lot of pain) seems to be > > > >> significantly > > > >>>> higher than to prove that setting some flag in some API has pretty > > > >> much 0 > > > >>>> chance of regression (we basically have a flag to say "unfix > > > >> KAFKA-9279" > > > >>> so > > > >>>> we're getting to fairly "known good" state). I'll let KStream > folks > > > >>>> comment on this one (and we still need to solve the problem of > > > >> accidental > > > >>>> handling of RTLE originated from broker, so some KIP would be > > required > > > >> to > > > >>>> somehow help to differentiate those). > > > >>>> > > > >>>> -Artem > > > >>>> > > > >>>> On Mon, Jul 15, 2024 at 1:31 PM Greg Harris > > > >> <greg.har...@aiven.io.invalid > > > >>>> > > > >>>> wrote: > > > >>>> > > > >>>>> Hi Artem, > > > >>>>> > > > >>>>> Thank you for clarifying as I'm joining the conversation late and > > may > > > >>> have > > > >>>>> some misconceptions. > > > >>>>> > > > >>>>>> Because of this, a more "complete" solution that > > > >>>>>> allows ignoring RecordTooLargeException regardless of its origin > > is > > > >>>>>> actually incorrect, while a "partial" solution that allows > > ignoring > > > >>>>>> RecordTooLargeException only originating in client code > > accomplishes > > > >>> the > > > >>>>>> required functionality. > > > >>>>> > > > >>>>> This is not how I understood this feature. Above Matthias said > the > > > >>>>> following: > > > >>>>> > > > >>>>>> We can do > > > >>>>>> follow up KIP for other errors on an on-demand basis and > > fix-forward > > > >> / > > > >>>>>> enlarge the scope successively. > > > >>>>> > > > >>>>> This makes me think that this IGNORE_SEND_ERRORS covers an > > arbitrary > > > >>> set of > > > >>>>> error conditions that may be expanded in the future, possibly to > > > cover > > > >>> the > > > >>>>> broker side RecordTooLargeException. > > > >>>>> > > > >>>>>> Obviously, we could solve this problem by changing logic in the > > > >>>>>> broker to return a different error when the batch is too large, > > but > > > >>> right > > > >>>>>> now this is not the case > > > >>>>> > > > >>>>> If the broker/wire protocol isn't ready for these errors to be > > > >>> propagated, > > > >>>>> then I don't think we're ready to add this API. It's going to be > > > >>>>> under-generalized, and there's a decent chance that we're going > to > > > >>> regret > > > >>>>> the design choices in the future. And users that expect it to be > > > fully > > > >>>>> generalized are going to be disappointed when they don't read the > > > fine > > > >>>>> print and still get faulted by non-covered errors. > > > >>>>> > > > >>>>>> AL2. In a high performance system, "just an optimization" can > be > > a > > > >>>>>> functional requirement ... > > > >>>>>> I just wanted to make the point that we shouldn't necessarily > > > >> dismiss > > > >>>>>> API changes that allow for optimizations. > > > >>>>> > > > >>>>> My earlier statement didn't dismiss this feature as "just an > > > >>> optimization", > > > >>>>> actually the opposite. I said that performance could be a > > > >> justification, > > > >>>>> but only if it is quantified and stated explicitly. We shouldn't > be > > > >>> voting > > > >>>>> on hand-wavy optimizations, we should be voting on things that > are > > > >>>>> quantifiable. > > > >>>>> For example an analysis like the following would facilitate > further > > > >>>>> discussion: "if a streams producer is producing 1MB/s, and the > > commit > > > >>>>> interval is 1 hour, I expect 3600MB of additional heap needed per > > > >>>>> producer". We can then discuss whether we expect higher or lower > > > >>>>> throughput, commit intervals, or heap usage to determine what the > > > >>> operating > > > >>>>> envelope of this feature could be. > > > >>>>> If there are a substantial number of users that have high > > throughput, > > > >>> long > > > >>>>> commit intervals, _and_ RTLEs, then this feature could make > sense. > > If > > > >>> not, > > > >>>>> then the downsides of this feature (complication of the API, > > > >>>>> under-specification of the error coverage, etc) look unjustified. > > In > > > >>> fact, > > > >>>>> if the number of users regularly encountering RTLEs is > sufficiently > > > >>> small, > > > >>>>> I would strongly advocate for an application-specific workaround > > > >>> instead of > > > >>>>> trying to fix this in Streams, or make memory buffering an > optional > > > >>> feature > > > >>>>> in streams. > > > >>>>> > > > >>>>> Thanks, > > > >>>>> Greg > > > >>>>> > > > >>>>> On Mon, Jul 15, 2024 at 1:29 PM Greg Harris < > greg.har...@aiven.io> > > > >>> wrote: > > > >>>>> > > > >>>>>> Hi Alieh, > > > >>>>>> > > > >>>>>> Thanks for your response. > > > >>>>>> > > > >>>>>>> what does a user do > > > >>>>>>> after a transaction is failed due to a `too-large-record > > > `exception? > > > >>>>> They > > > >>>>>>> will submit the same batch without the problematic record > again. > > > >>>>>> > > > >>>>>> If they re-submit the same record, they are indicating that this > > > >> record > > > >>>>> is > > > >>>>>> an integral part of the transaction, and the transaction should > > only > > > >> be > > > >>>>>> committed with it present. If the record isn't integral to the > > > >>>>> transaction, > > > >>>>>> they shouldn't submit it as part of the transaction. > > > >>>>>> > > > >>>>>>> Regarding your solution to solve the issue application-side: I > > am > > > a > > > >>>>>>> bit hesitant to keep all sent records in memory since I think > > > >>> buffering > > > >>>>>>> records twice (both in Streams and Producer) would not be an > > > >> efficient > > > >>>>>>> solution. > > > >>>>>> > > > >>>>>> I understand your hesitation, and this touches on the > > "performance" > > > >>>>> caveat > > > >>>>>> of the end-to-end arguments in system design. There are no > perfect > > > >>>>> designs, > > > >>>>>> and some API cleanliness may be sacrificed in favor of more > > > >> performant > > > >>>>>> solutions. You would need to make a concrete and convincing > > argument > > > >>> that > > > >>>>>> the performance of this solution would be better than every > > > >>> alternative. > > > >>>>> To > > > >>>>>> that end, I would recommend that you add more to the "Rejected > > > >>>>>> Alternatives" section, as that is going to carry this proposal. > > > >>>>>> Some alternatives that I can think of, but which aren't > > necessarily > > > >>>>> better: > > > >>>>>> 1. Streams users that specifically want to drop oversize records > > can > > > >>>>>> estimate the size of their data and drop records which are too > > > >>>>>> large, enforcing their own limits that are lower than the Kafka > > > >> limits. > > > >>>>>> 2. Streams users that want CONTINUE semantics can use > > at_least_once > > > >>>>>> semantics > > > >>>>>> 3. Streams itself can store record hashes/coordinates and fast > > > rewind > > > >>> to > > > >>>>>> the end of the last transaction, recomputing data rather than > > > storing > > > >>> it. > > > >>>>>> 4. Streams can define exactly_once + CONTINUE semantics to > permit > > > the > > > >>>>>> whole transaction to be dropped, because it would allow the next > > > >> batch > > > >>> to > > > >>>>>> be started processing. > > > >>>>>> 5. Streams can emit records with both a transactional and > > > >>>>>> non-transactional producer if some records are not critical-path > > > >>>>>> > > > >>>>>> To generalize this point: Suppose an application tries to > minimize > > > >>>>> storage > > > >>>>>> costs by having only one party responsible for a piece of data > at > > a > > > >>> time. > > > >>>>>> They initially have the data, call send(), and want to know the > > > >>> earliest > > > >>>>>> time they can forget the data and transfer the responsibility to > > > >> Kafka. > > > >>>>>> With a non-transactional producer, they are responsible for the > > data > > > >>>>> until > > > >>>>>> the send() callback has succeeded. With a transactional > producer, > > > >> they > > > >>>>> are > > > >>>>>> responsible for the data until commitTransaction() has > succeeded. > > > >>>>>> With this proposed change that makes the producer tolerate > > > >>>>>> too-large-exceptions, applications are still responsible for > > storing > > > >>>>> their > > > >>>>>> data until commitTransaction() has succeeded, because > > > >>> abortTransaction() > > > >>>>>> could have also been called, or the producer could have been > > fenced, > > > >> or > > > >>>>> any > > > >>>>>> number of other failures could have occurred. This feature does > > not > > > >>>>> enable > > > >>>>>> Streams to drop responsibility earlier, it carves out a specific > > > >>>>> situation > > > >>>>>> in which it doesn't have to rewind processing, which is a > > > performance > > > >>>>>> concern. > > > >>>>>> > > > >>>>>> For Streams and the general case, if an application is trying to > > > >>> optimize > > > >>>>>> storage costs, they should optimize for smaller transactions, > > > because > > > >>>>> this > > > >>>>>> both lowers the bound on record re-delivery and lowers the > > > likelihood > > > >>> of > > > >>>>> a > > > >>>>>> bad record being included in any individual transaction. > > > >>>>>> > > > >>>>>> Thanks, > > > >>>>>> Greg > > > >>>>>> > > > >>>>>> On Mon, Jul 15, 2024 at 12:35 PM Artem Livshits > > > >>>>>> <alivsh...@confluent.io.invalid> wrote: > > > >>>>>> > > > >>>>>>> Hi Greg, > > > >>>>>>> > > > >>>>>>> What you say makes a lot of sense. I just wanted to clarify a > > > >> couple > > > >>> of > > > >>>>>>> subtle points. > > > >>>>>>> > > > >>>>>>> AL1. There is a functional reason to handle errors that happen > on > > > >> send > > > >>>>>>> (oginate in the producer logic in the client) vs. errors that > are > > > >>>>> returned > > > >>>>>>> from the broker. The problem is that RecordTooLargeException > is > > > >>>>> returned > > > >>>>>>> in two cases: (1) the producer logic on the client checks that > > > >> record > > > >>> is > > > >>>>>>> too large and throws the exception before doing anything with > > this > > > >> -- > > > >>>>> this > > > >>>>>>> is very "clean" situation with one specific record being marked > > as > > > >>>>> "poison > > > >>>>>>> pill" and rejected; (2) the broker throws the same error if the > > > >> batch > > > >>> is > > > >>>>>>> too large -- the batch may include multiple records and none of > > > them > > > >>>>> would > > > >>>>>>> necessarily be a "poison pill" record, it's just a random > > > >>>>> misconfiguration > > > >>>>>>> of client vs. broker. Because of this, a more "complete" > > solution > > > >>> that > > > >>>>>>> allows ignoring RecordTooLargeException regardless of its > origin > > is > > > >>>>>>> actually incorrect, while a "partial" solution that allows > > ignoring > > > >>>>>>> RecordTooLargeException only originating in client code > > > accomplishes > > > >>> the > > > >>>>>>> required functionality. This is an important nuance and should > > be > > > >>> added > > > >>>>>>> to > > > >>>>>>> the KIP. Obviously, we could solve this problem by changing > > logic > > > >> in > > > >>>>> the > > > >>>>>>> broker to return a different error when the batch is too large, > > but > > > >>>>> right > > > >>>>>>> now this is not the case (and to have the correct error > handling > > > >> we'd > > > >>>>> need > > > >>>>>>> to know the version of the broker so we can only drop the > records > > > if > > > >>> the > > > >>>>>>> error is returned from a broker that knows to return a > different > > > >>> error). > > > >>>>>>> > > > >>>>>>> AL2. In a high performance system, "just an optimization" can > > be a > > > >>>>>>> functional requirement -- if a solution impacts memory or > > > >>> computational > > > >>>>>>> complexity (in the sense of bigO notation) on the main code > path > > I > > > >> can > > > >>>>>>> justify changing APIs to avoid such an impact. I'll let > KStream > > > >> folks > > > >>>>>>> comment on whether an implementation that requires storing > > records > > > >> in > > > >>>>>>> memory actually violates the computational complexity on the > main > > > >> code > > > >>>>>>> path, I just wanted to make the point that we shouldn't > > necessarily > > > >>>>>>> dismiss > > > >>>>>>> API changes that allow for optimizations. > > > >>>>>>> > > > >>>>>>> -Artem > > > >>>>>>> > > > >>>>>>> On Fri, Jul 12, 2024 at 1:07 PM Greg Harris > > > >>>>> <greg.har...@aiven.io.invalid > > > >>>>>>>> > > > >>>>>>> wrote: > > > >>>>>>> > > > >>>>>>>> Hi all, > > > >>>>>>>> > > > >>>>>>>> Alieh, thanks for the KIP! And everyone else, thanks for the > > > robust > > > >>>>>>>> discussion. > > > >>>>>>>> > > > >>>>>>>> I understand that there are situations in which users desire > > that > > > >> the > > > >>>>>>>> pipeline "just keep working" and skip errors. However, I > > question > > > >>>>>>> whether > > > >>>>>>>> it is appropriate to support/encourage this behavior via > > inclusion > > > >> in > > > >>>>>>> the > > > >>>>>>>> Producer API. > > > >>>>>>>> This feature is essentially a "non-atomic transaction", as it > > > >> allows > > > >>>>>>>> commits in which not all records passed to send() ultimately > get > > > >>>>>>> committed. > > > >>>>>>>> As atomicity is one of the most important semantics associated > > > with > > > >>>>>>>> transactions, I question whether there are users other than > > > Streams > > > >>>>> that > > > >>>>>>>> would choose non-atomic transactions over a > > traditional/idempotent > > > >>>>>>>> producer. > > > >>>>>>>> Some cursory research shows that non-atomic transactions may > be > > > >>>>> present > > > >>>>>>> in > > > >>>>>>>> other databases, but is actively discouraged due to the > > complexity > > > >>>>> they > > > >>>>>>> add > > > >>>>>>>> to error-handling. [1] > > > >>>>>>>> > > > >>>>>>>> I'd like to invoke the End-to-End Arguments in System Design > [2] > > > >>> here, > > > >>>>>>> and > > > >>>>>>>> recommend that this behavior may be present in Streams, but > > should > > > >>> not > > > >>>>>>> be > > > >>>>>>>> in the Producer. > > > >>>>>>>> 1. Dropping records that cause errors is already expressible > via > > > >> the > > > >>>>>>>> current Producer API. You can store the records in-memory > after > > > >>>>> calling > > > >>>>>>>> send(), wait for a successful no-error flush() before calling > > > >>>>>>>> commitTransaction() and allowing the record to be garbage > > > >> collected. > > > >>>>> If > > > >>>>>>>> errors occur, abortTransaction() and re-submit the records. > > > >>>>>>>> 2. Implementing this inside the Producer API is complex and > > > >> difficult > > > >>>>> to > > > >>>>>>>> holistically define in a way that we won't regret or need to > > > change > > > >>>>>>> later. > > > >>>>>>>> I think some of the disagreement in this thread originates > from > > > >> this, > > > >>>>>>> and I > > > >>>>>>>> don't find the proposed API satisfactory. > > > >>>>>>>> 3. The performance improvement of including this change in the > > > >> lower > > > >>>>>>> level > > > >>>>>>>> needs to be quantified in order to be a justification, and I > > don't > > > >>> see > > > >>>>>>> any > > > >>>>>>>> analysis about this. > > > >>>>>>>> > > > >>>>>>>> I imagine that the alternative implementation I suggested in > (1) > > > >>> would > > > >>>>>>> also > > > >>>>>>>> enable more expressive error handlers in Streams, if such a > > thing > > > >> was > > > >>>>>>>> desired. Keeping the record around until after the transaction > > is > > > >>>>>>> committed > > > >>>>>>>> would enable a DLQ or passing the erroneous record to the > error > > > >>>>> handler. > > > >>>>>>>> > > > >>>>>>>> I think that the current pattern of the application being > > > >> responsible > > > >>>>>>> for > > > >>>>>>>> providing good data to the producer is very reasonable; Having > > the > > > >>>>>>> producer > > > >>>>>>>> responsible for implementing the application's error handling > of > > > >> bad > > > >>>>>>> data > > > >>>>>>>> is not something I can support. > > > >>>>>>>> > > > >>>>>>>> Thanks, > > > >>>>>>>> Greg > > > >>>>>>>> > > > >>>>>>>> [1] https://www.sommarskog.se/error_handling/Part1.html > > > >>>>>>>> [2] > > > >>>>> > https://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf > > > >>>>>>>> > > > >>>>>>>> On Fri, Jul 12, 2024 at 8:52 AM Justine Olshan > > > >>>>>>>> <jols...@confluent.io.invalid> > > > >>>>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> Can we update the KIP to clearly document these decisions? > > > >>>>>>>>> > > > >>>>>>>>> Thanks, > > > >>>>>>>>> > > > >>>>>>>>> Justine > > > >>>>>>>>> > > > >>>>>>>>> On Tue, Jul 9, 2024 at 9:25 AM Andrew Schofield < > > > >>>>>>>> andrew_schofi...@live.com > > > >>>>>>>>>> > > > >>>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> Hi Chris, > > > >>>>>>>>>> As it stands, the error handling for transactions in > > > >> KafkaProducer > > > >>>>>>> is > > > >>>>>>>> not > > > >>>>>>>>>> ideal. There’s no reason why a failed operation should fail > a > > > >>>>>>>> transaction > > > >>>>>>>>>> provided that the application can tell that the operation > was > > > not > > > >>>>>>>>> included > > > >>>>>>>>>> in the transaction and then make its own decision whether to > > > >>>>>>> continue > > > >>>>>>>> or > > > >>>>>>>>>> back out. So, I think I disagree with the original premise > of > > a > > > >>>>>>>>> client-side > > > >>>>>>>>>> error state for a transaction, but we are where we are. > > > >>>>>>>>>> > > > >>>>>>>>>> When I voted, I did not expect the KIP to handle ALL errors > > > which > > > >>>>>>> could > > > >>>>>>>>>> conceivably be handled. I did expect it to handle > client-side > > > >> send > > > >>>>>>>> errors > > > >>>>>>>>>> that would cause a record to be rejected from a batch before > > > >>>>> sending > > > >>>>>>>> to a > > > >>>>>>>>>> broker. I think that it does make the KafkaProducer > interface > > > >> very > > > >>>>>>>>> slightly > > > >>>>>>>>>> more complicated, but the new option is a clear improvement > > and > > > I > > > >>>>>>>>>> don’t see anyone getting into a mess using it. > > > >>>>>>>>>> > > > >>>>>>>>>> I think broker-side errors are more tricky and I don’t think > > an > > > >>>>>>>> overload > > > >>>>>>>>>> on the send() method is going to do the job. I don’t see > that > > as > > > >> a > > > >>>>>>>>> problem > > > >>>>>>>>>> with the KIP, just that the underlying RPCs and behaviour is > > not > > > >>>>>>> very > > > >>>>>>>>>> amenable to record-specific error handling. The Produce RPC > > is a > > > >>>>>>>>>> complicated beast which can include a set of records for > > mutiple > > > >>>>>>>>>> topic-partitions. Although ProduceResponse v10 does include > > > >> record > > > >>>>>>>>>> errors, I don’t believe this is surfaced in the client. > Let’s > > > >>>>>>> imagine > > > >>>>>>>>>> something > > > >>>>>>>>>> like broker-side record validation which barfs on one > record. > > > >>>>>>> Failing > > > >>>>>>>> an > > > >>>>>>>>>> entire batch is easier, but less useful if the problem is > > > related > > > >>>>> to > > > >>>>>>>> one > > > >>>>>>>>>> record. > > > >>>>>>>>>> > > > >>>>>>>>>> In summary, I’m happy that my vote stands, and I am happy > with > > > >> the > > > >>>>>>> KIP > > > >>>>>>>>>> only supporting client-side record errors. > > > >>>>>>>>>> > > > >>>>>>>>>> Thanks, > > > >>>>>>>>>> Andrew > > > >>>>>>>>>> > > > >>>>>>>>>>> On 8 Jul 2024, at 16:37, Chris Egerton > > <chr...@aiven.io.INVALID > > > >>>>>> > > > >>>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>> Hi Alieh, > > > >>>>>>>>>>> > > > >>>>>>>>>>> Can you clarify why broker-side errors shouldn't be > covered? > > > The > > > >>>>>>> only > > > >>>>>>>>>> real > > > >>>>>>>>>>> rationale I can come up with is that it's easier to > > implement. > > > >>>>>>>>>>> > > > >>>>>>>>>>> "Things were better for Kafka Streams before KAFKA-9279 was > > > >>>>> fixed" > > > >>>>>>>>> isn't > > > >>>>>>>>>>> very convincing, because Kafka Streams is not the only user > > of > > > >>>>> the > > > >>>>>>>> Java > > > >>>>>>>>>>> producer client. And for others, especially new users, I > > doubt > > > >>>>>>> that > > > >>>>>>>>> this > > > >>>>>>>>>>> new API we're proposing would make sense without having to > > > >>>>>>> consult a > > > >>>>>>>>> lot > > > >>>>>>>>>> of > > > >>>>>>>>>>> historical context. > > > >>>>>>>>>>> > > > >>>>>>>>>>> I also don't think that most users will know or even care > > about > > > >>>>>>> the > > > >>>>>>>>>>> distinction between errors that cause a record to fail > before > > > >>>>> it's > > > >>>>>>>>> added > > > >>>>>>>>>> to > > > >>>>>>>>>>> a batch vs. after. If you were writing a producer > application > > > of > > > >>>>>>> your > > > >>>>>>>>>> own, > > > >>>>>>>>>>> and you wanted to handle RecordTooLargeException instances > by > > > >>>>>>>> dropping > > > >>>>>>>>> a > > > >>>>>>>>>>> record without aborting a transaction, would you care about > > > >>>>>>> whether > > > >>>>>>>> it > > > >>>>>>>>>> was > > > >>>>>>>>>>> your client or your broker that balked? Would you be happy > if > > > >>>>> you > > > >>>>>>>> wrote > > > >>>>>>>>>>> logic expecting that that problem was solved once and for > > all, > > > >>>>>>> only > > > >>>>>>>> to > > > >>>>>>>>>>> learn that it could still affect you in other > circumstances? > > > Or, > > > >>>>>>>>>>> alternatively, would you be happy if you wanted to solve > that > > > >>>>>>> problem > > > >>>>>>>>> and > > > >>>>>>>>>>> found an API that seemed to do exactly what you wanted, but > > > >>>>> after > > > >>>>>>>>> reading > > > >>>>>>>>>>> the fine print, realized you'd have to do it yourself > > instead? > > > >>>>>>>>>>> > > > >>>>>>>>>>> Ultimately, the more I think about this, the more I believe > > > that > > > >>>>>>>> we're > > > >>>>>>>>>>> adding noise to the API (with the new overloaded variant of > > > >>>>> send) > > > >>>>>>>> for a > > > >>>>>>>>>>> feature that will likely bring confusion and even > frustration > > > to > > > >>>>>>>> anyone > > > >>>>>>>>>>> besides maintainers of Kafka Streams who tries to use it. > > > >>>>>>>>>>> > > > >>>>>>>>>>> If the only concern about covering broker-side errors is > that > > > it > > > >>>>>>>> would > > > >>>>>>>>> be > > > >>>>>>>>>>> more difficult to implement, I believe we should strongly > > > >>>>>>> reconsider > > > >>>>>>>>> that > > > >>>>>>>>>>> alternative. That said, if there is a straightforward way > to > > > >>>>>>> explain > > > >>>>>>>>> this > > > >>>>>>>>>>> feature to new users that won't mislead them or require > them > > to > > > >>>>> do > > > >>>>>>>>>> research > > > >>>>>>>>>>> on producer internals, then I can still live with it. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Regarding a list of recoverable vs. irrecoverable errors, > > this > > > >>>>> is > > > >>>>>>>>>> actually > > > >>>>>>>>>>> the subject of another recently-introduced KIP: > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>> > > > >>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1050%3A+Consistent+error+handling+for+Transactions > > > >>>>>>>>>>> > > > >>>>>>>>>>> Finally, I'd also like to ask the people who have already > > voted > > > >>>>>>>>> (Andrew, > > > >>>>>>>>>>> Matthias) if, at the time they voted, they believed that > the > > > API > > > >>>>>>>> would > > > >>>>>>>>>>> handle all errors, or only the subset of errors that would > > > >>>>> cause a > > > >>>>>>>>> record > > > >>>>>>>>>>> to be rejected from a batch before it can be sent to a > > broker. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Best, > > > >>>>>>>>>>> > > > >>>>>>>>>>> Chris > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Thu, Jul 4, 2024 at 12:43 PM Alieh Saeedi > > > >>>>>>>>>> <asae...@confluent.io.invalid> > > > >>>>>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>>> Salut from the KIP’s author > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Clarifying two points: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> 1) broker side errors: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> As far as I remember we are not going to cover the errors > > > >>>>>>>> originating > > > >>>>>>>>>> from > > > >>>>>>>>>>>> the broker! > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> A historical fact: One of the debate points in KIP-1038 > was > > > >>>>> that > > > >>>>>>> by > > > >>>>>>>>>>>> defining a producer custom handler, the user may assume > that > > > >>>>>>>>> broker-side > > > >>>>>>>>>>>> errors must be covered as well. They may define a handler > > for > > > >>>>>>>> handling > > > >>>>>>>>>>>> `RecordTooLargeException` and still see such errors not > > being > > > >>>>>>>> handled > > > >>>>>>>>> as > > > >>>>>>>>>>>> they wish. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> 2) Regarding irrecoverable/recoverable errors: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Before the fix of `KAFKA-9279`, errors such as > > > >>>>>>>>>> `RecordTooLargeException` > > > >>>>>>>>>>>> or errors related to missing meta data (both originating > > from > > > >>>>>>>> Producer > > > >>>>>>>>>>>> `send()`) were considered as recoverable but after that > they > > > >>>>>>> turned > > > >>>>>>>>> into > > > >>>>>>>>>>>> being irrecoverable without changing any Javadocs or > having > > > any > > > >>>>>>> KIP. > > > >>>>>>>>>> All > > > >>>>>>>>>>>> the effort made in this KIP and the former one have been > > > >>>>> towards > > > >>>>>>>>>> returning > > > >>>>>>>>>>>> to the former state. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I am sure that it is clear for you that which sort of > errors > > > we > > > >>>>>>> are > > > >>>>>>>>>> going > > > >>>>>>>>>>>> to cover: A single record may happen to NOT get added to > the > > > >>>>>>> batch > > > >>>>>>>> due > > > >>>>>>>>>> to > > > >>>>>>>>>>>> the issues with the record or its corresponding topic. The > > > >>>>> point > > > >>>>>>> was > > > >>>>>>>>>> that > > > >>>>>>>>>>>> if the record is not added to the batch let ’s don’t fail > > the > > > >>>>>>> whole > > > >>>>>>>>>> batch > > > >>>>>>>>>>>> because of that non-existing record. We never intended to > do > > > >>>>> sth > > > >>>>>>> in > > > >>>>>>>>>> broker > > > >>>>>>>>>>>> side or ignore more important errors. But I agree with > you > > > >>>>>>> Chris. > > > >>>>>>>> If > > > >>>>>>>>> we > > > >>>>>>>>>>>> are adding a new API, we must have good documentation for > > > that. > > > >>>>>>> The > > > >>>>>>>>>>>> sentence `all irrecoverable transactional errors will > still > > be > > > >>>>>>>> fatal` > > > >>>>>>>>> as > > > >>>>>>>>>>>> you suggested is good. What do you think? I am totally > > against > > > >>>>>>>>>> enumerating > > > >>>>>>>>>>>> errors in Javadocs since these sort of errors can be > > changing > > > >>>>>>> during > > > >>>>>>>>>>>> time. More > > > >>>>>>>>>>>> over, have you ever seen any list of recoverable or > > > >>>>> irrecoverable > > > >>>>>>>>> errors > > > >>>>>>>>>>>> somewhere so far? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Bests, > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Alieh > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On Wed, Jul 3, 2024 at 6:07 PM Chris Egerton > > > >>>>>>>> <chr...@aiven.io.invalid > > > >>>>>>>>>> > > > >>>>>>>>>>>> wrote: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> Hi Justine, > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> I agree that enumerating a list of errors that should be > > > >>>>>>> covered by > > > >>>>>>>>> the > > > >>>>>>>>>>>> KIP > > > >>>>>>>>>>>>> is difficult; I was thinking it might be easier if we > list > > > the > > > >>>>>>>> errors > > > >>>>>>>>>>>> that > > > >>>>>>>>>>>>> should _not_ be covered by the KIP, and only if we can't > > > >>>>> define > > > >>>>>>> a > > > >>>>>>>>>>>>> reasonable heuristic that would cover them without having > > to > > > >>>>>>>>> explicitly > > > >>>>>>>>>>>>> list them. Could it be enough to say "all irrecoverable > > > >>>>>>>> transactional > > > >>>>>>>>>>>>> errors will still be fatal", or even just "all > > transactional > > > >>>>>>> errors > > > >>>>>>>>> (as > > > >>>>>>>>>>>>> opposed to errors related to this specific record) will > > still > > > >>>>> be > > > >>>>>>>>>> fatal"? > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Cheers, > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Chris > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> On Wed, Jul 3, 2024 at 11:56 AM Justine Olshan > > > >>>>>>>>>>>>> <jols...@confluent.io.invalid> > > > >>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Hey Chris, > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> I think what you say makes sense. I agree that defining > > the > > > >>>>>>>> behavior > > > >>>>>>>>>>>>> based > > > >>>>>>>>>>>>>> on code that can possibly change is not a good idea, > and I > > > >>>>> was > > > >>>>>>>>> trying > > > >>>>>>>>>>>> to > > > >>>>>>>>>>>>>> get a clearer definition from the KIP's author :) > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> I think it can always be hard to ensure that only > specific > > > >>>>>>> errors > > > >>>>>>>>> are > > > >>>>>>>>>>>>>> handled unless they are explicitly enumerated in code as > > the > > > >>>>>>> code > > > >>>>>>>>> can > > > >>>>>>>>>>>>>> change and can be changed by folks who are not aware of > > this > > > >>>>>>> KIP > > > >>>>>>>> or > > > >>>>>>>>>>>>>> conversation. > > > >>>>>>>>>>>>>> I personally don't have the bandwidth to do this > > > >>>>>>>>>> definition/enumeration > > > >>>>>>>>>>>>> of > > > >>>>>>>>>>>>>> errors, so hopefully Alieh can expand upon this. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Justine > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Wed, Jul 3, 2024 at 8:28 AM Chris Egerton > > > >>>>>>>>> <chr...@aiven.io.invalid > > > >>>>>>>>>>> > > > >>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Hi Alieh, > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> I don't love defining the changes for this KIP in terms > > of > > > a > > > >>>>>>>> catch > > > >>>>>>>>>>>>> clause > > > >>>>>>>>>>>>>>> in the KafkaProducer class, for two reasons. First, the > > set > > > >>>>> of > > > >>>>>>>>> errors > > > >>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>> are handled by that clause may shift over time as the > > code > > > >>>>>>> base > > > >>>>>>>> is > > > >>>>>>>>>>>>>>> modified, and second, it would be fairly opaque to > users > > > who > > > >>>>>>> want > > > >>>>>>>>> to > > > >>>>>>>>>>>>>>> understand whether an error would be affected by using > > this > > > >>>>>>> API > > > >>>>>>>> or > > > >>>>>>>>>>>> not. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> It also seems strange that we'd handle some types of > > > >>>>>>>>>>>>>>> RecordTooLargeException (i.e., ones reported > client-side) > > > >>>>> with > > > >>>>>>>> this > > > >>>>>>>>>>>>> API, > > > >>>>>>>>>>>>>>> but not others (i.e., ones reported by a broker). > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> I think this kind of API would be most powerful, most > > > >>>>>>> intuitive > > > >>>>>>>> to > > > >>>>>>>>>>>>> users, > > > >>>>>>>>>>>>>>> and easiest to document if we expanded the scope to all > > > >>>>>>>>>>>>>> record-send-related > > > >>>>>>>>>>>>>>> errors, except anything indicating issues with > > exactly-once > > > >>>>>>>>>>>> semantics. > > > >>>>>>>>>>>>>> That > > > >>>>>>>>>>>>>>> would include records that are too large (when caught > > both > > > >>>>>>>> client- > > > >>>>>>>>>>>> and > > > >>>>>>>>>>>>>>> server-side), records that can't be sent due to > > > >>>>> authorization > > > >>>>>>>>>>>> failures, > > > >>>>>>>>>>>>>>> records sent to nonexistent topics/topic partitions, > and > > > >>>>>>> keyless > > > >>>>>>>>>>>>> records > > > >>>>>>>>>>>>>>> sent to compacted topics. It would not include > > > >>>>>>>>>>>>>>> ProducerFencedException, InvalidProducerEpochException, > > > >>>>>>>>>>>>>>> UnsupportedVersionException, > > > >>>>>>>>>>>>>>> and possibly others. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> @Justine -- do you think it would be possible to > develop > > > >>>>>>> either a > > > >>>>>>>>>>>>> better > > > >>>>>>>>>>>>>>> definition for the kinds of "excluded" errors that > should > > > >>>>> not > > > >>>>>>> be > > > >>>>>>>>>>>>> covered > > > >>>>>>>>>>>>>> by > > > >>>>>>>>>>>>>>> this API, or, barring that, a comprehensive list of > exact > > > >>>>>>> error > > > >>>>>>>>>>>> types? > > > >>>>>>>>>>>>>> And > > > >>>>>>>>>>>>>>> do you think this would be acceptable in terms of risk > > and > > > >>>>>>>>>>>> complexity? > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Cheers, > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Chris > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> On Tue, Jul 2, 2024 at 5:05 PM Alieh Saeedi > > > >>>>>>>>>>>>> <asae...@confluent.io.invalid > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> Hey Justine, > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> About the consequences: the consequences will be like > > when > > > >>>>> we > > > >>>>>>>> did > > > >>>>>>>>>>>> not > > > >>>>>>>>>>>>>>> have > > > >>>>>>>>>>>>>>>> the fix made in `KAFKA-9279`: silent loss of data! > > > >>>>> Obviously, > > > >>>>>>>> when > > > >>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>> user > > > >>>>>>>>>>>>>>>> intentionally chose to ignore errors, that would not > be > > > >>>>>>> silent > > > >>>>>>>> any > > > >>>>>>>>>>>>>> more. > > > >>>>>>>>>>>>>>>> Right? > > > >>>>>>>>>>>>>>>> Of course, considering all types of `ApiException`s > > would > > > >>>>> be > > > >>>>>>> too > > > >>>>>>>>>>>>> broad. > > > >>>>>>>>>>>>>>> But > > > >>>>>>>>>>>>>>>> are the exceptions caught in `catch(ApiException e)` > of > > > the > > > >>>>>>>>>>>>> `doSend()` > > > >>>>>>>>>>>>>>>> method also too broad? > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> -Alieh > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> On Tue, Jul 2, 2024 at 9:45 PM Justine Olshan > > > >>>>>>>>>>>>>>> <jols...@confluent.io.invalid > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Hey Alieh, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> If we want to allow any error to be ignored we should > > > >>>>>>> probably > > > >>>>>>>>>>>> run > > > >>>>>>>>>>>>>>>> through > > > >>>>>>>>>>>>>>>>> all the errors to make sure they make sense. > > > >>>>>>>>>>>>>>>>> I just want to feel confident that we aren't just > > making > > > a > > > >>>>>>>>>>>> decision > > > >>>>>>>>>>>>>>>> without > > > >>>>>>>>>>>>>>>>> considering the consequences carefully. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Justine > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> On Tue, Jul 2, 2024 at 12:30 PM Alieh Saeedi > > > >>>>>>>>>>>>>>>> <asae...@confluent.io.invalid > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Hey Justine, > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> yes we talked about `RecordTooLargeException` as an > > > >>>>>>> example, > > > >>>>>>>>>>>> but > > > >>>>>>>>>>>>>> did > > > >>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>> ever limit ourselves to only this specific > exception? > > I > > > >>>>>>> think > > > >>>>>>>>>>>>>> neither > > > >>>>>>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>>> the KIP nor in the PR. As Chris mentioned, this KIP > > is > > > >>>>>>> going > > > >>>>>>>>>>>> to > > > >>>>>>>>>>>>>> undo > > > >>>>>>>>>>>>>>>>> what > > > >>>>>>>>>>>>>>>>>> we have done in `KAFKA-9279` in case 1) the user is > > in a > > > >>>>>>>>>>>>>> transaction > > > >>>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>> 2) > > > >>>>>>>>>>>>>>>>>> he decides to ignore the errors in which the record > > was > > > >>>>> not > > > >>>>>>>>>>>> even > > > >>>>>>>>>>>>>>> added > > > >>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>> the batch. Yes, and we suggested some methods for > > > undoing > > > >>>>>>> or, > > > >>>>>>>>>>>> in > > > >>>>>>>>>>>>>>> fact, > > > >>>>>>>>>>>>>>>>>> moving back the transaction from the error state in > > > >>>>>>> `flush` or > > > >>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>>> `commitTnx` and we finally came to the idea of not > > even > > > >>>>>>> doing > > > >>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>> changes > > > >>>>>>>>>>>>>>>>>> (better than undoing) in `send`. > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Bests, > > > >>>>>>>>>>>>>>>>>> Alieh > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> On Tue, Jul 2, 2024 at 8:03 PM Justine Olshan > > > >>>>>>>>>>>>>>>>> <jols...@confluent.io.invalid > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> Hey folks, > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> I understand where you are coming from by asking > for > > > >>>>>>> specific > > > >>>>>>>>>>>>> use > > > >>>>>>>>>>>>>>>>> cases. > > > >>>>>>>>>>>>>>>>>> My > > > >>>>>>>>>>>>>>>>>>> understanding based on previous conversations was > > that > > > >>>>>>> there > > > >>>>>>>>>>>>>> were a > > > >>>>>>>>>>>>>>>> few > > > >>>>>>>>>>>>>>>>>>> different errors that have been seen. > > > >>>>>>>>>>>>>>>>>>> One example I heard some information about was when > > the > > > >>>>>>>>>>>> record > > > >>>>>>>>>>>>>> was > > > >>>>>>>>>>>>>>>> too > > > >>>>>>>>>>>>>>>>>>> large and it fails the batch. Besides that, I'm not > > > >>>>> really > > > >>>>>>>>>>>> sure > > > >>>>>>>>>>>>>> if > > > >>>>>>>>>>>>>>>>> there > > > >>>>>>>>>>>>>>>>>>> are cases in mind, though it is fair to ask on > those > > > and > > > >>>>>>>>>>>> bring > > > >>>>>>>>>>>>>> them > > > >>>>>>>>>>>>>>>> up. > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Does a record qualify as a poison pill if it > > targets a > > > >>>>>>>>>>>> topic > > > >>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>> doesn't exist? Or if it targets a topic that the > > > >>>>> producer > > > >>>>>>>>>>>>>> principal > > > >>>>>>>>>>>>>>>>> lacks > > > >>>>>>>>>>>>>>>>>>> ACLs for? What if it fails broker-side validation > > > (e.g., > > > >>>>>>> has > > > >>>>>>>>>>>> a > > > >>>>>>>>>>>>>> null > > > >>>>>>>>>>>>>>>> key > > > >>>>>>>>>>>>>>>>>> for > > > >>>>>>>>>>>>>>>>>>> a compacted topic)? > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> I think there was some parallel work with > addressing > > > the > > > >>>>>>>>>>>>>>>>>>> UnknownTopicOrPartitionError in another way. As for > > the > > > >>>>>>> other > > > >>>>>>>>>>>>>>> checks, > > > >>>>>>>>>>>>>>>>>> acls, > > > >>>>>>>>>>>>>>>>>>> validation etc. I am not aware of that being in > > Alieh's > > > >>>>>>>>>>>> scope, > > > >>>>>>>>>>>>>> but > > > >>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>>> should be clear about exactly what we are doing. > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> All errors that fall into ApiException seems too > > broad > > > >>>>> to > > > >>>>>>> me. > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> Justine > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> On Tue, Jul 2, 2024 at 10:51 AM Alieh Saeedi > > > >>>>>>>>>>>>>>>>>> <asae...@confluent.io.invalid > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Hey Chris, > > > >>>>>>>>>>>>>>>>>>>> thanks for sharing your concerns. > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> 1) About the language of KIP (or maybe later in > > > >>>>>>> Javadocs): > > > >>>>>>>>>>>> Is > > > >>>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>> alright > > > >>>>>>>>>>>>>>>>>>>> if I write all errors that fall into the > > > `ApiException` > > > >>>>>>>>>>>>>> category > > > >>>>>>>>>>>>>>>>> thrown > > > >>>>>>>>>>>>>>>>>>>> (actually returned) by Producer? > > > >>>>>>>>>>>>>>>>>>>> 2) About future expansion: do you have any better > > > >>>>>>>>>>>> suggestions > > > >>>>>>>>>>>>>> for > > > >>>>>>>>>>>>>>>>> enum > > > >>>>>>>>>>>>>>>>>>>> names? Do you think `IGNORE_API_EXEPTIONS` or > > > something > > > >>>>>>>>>>>> like > > > >>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>>>>> "better/more accurate" one? > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Bests, > > > >>>>>>>>>>>>>>>>>>>> Alieh > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> On Tue, Jul 2, 2024 at 7:29 PM Chris Egerton > > > >>>>>>>>>>>>>>>> <chr...@aiven.io.invalid > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> Hi Alieh and Justine, > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> I'm concerned that we're settling on a definition > > of > > > >>>>>>>>>>>>> "poison > > > >>>>>>>>>>>>>>>> pill" > > > >>>>>>>>>>>>>>>>>>> that's > > > >>>>>>>>>>>>>>>>>>>>> easiest to tackle right now but may lead to > > > >>>>> shortcomings > > > >>>>>>>>>>>>> down > > > >>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>> road. I > > > >>>>>>>>>>>>>>>>>>>>> understand the relationship between this KIP and > > > >>>>>>>>>>>>> KAFKA-9279, > > > >>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>> I > > > >>>>>>>>>>>>>>>>>> can > > > >>>>>>>>>>>>>>>>>>>>> totally get behind the desire to keep things > small, > > > >>>>>>>>>>>>> focused, > > > >>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>>> simple > > > >>>>>>>>>>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>>>>>> the name of avoiding bugs. However, what I don't > > > think > > > >>>>>>> is > > > >>>>>>>>>>>>>> clear > > > >>>>>>>>>>>>>>>> at > > > >>>>>>>>>>>>>>>>>> all > > > >>>>>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>>>>> what the "specific circumstances" are that > Justine > > > >>>>>>>>>>>>>> mentioned. I > > > >>>>>>>>>>>>>>>>> had a > > > >>>>>>>>>>>>>>>>>>>>> drastically different idea of what the intended > > > >>>>>>>>>>>> behavioral > > > >>>>>>>>>>>>>>> change > > > >>>>>>>>>>>>>>>>>> would > > > >>>>>>>>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>>>>> before looking at the draft PR. > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> I would like 1) for us to be clearer about the > > > >>>>>>> categories > > > >>>>>>>>>>>>> of > > > >>>>>>>>>>>>>>>> errors > > > >>>>>>>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>>>>> want to cover with this new API (especially since > > > >>>>> we'll > > > >>>>>>>>>>>>> have > > > >>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>> find > > > >>>>>>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>>>>>> clear, succinct way to document this for users), > > and > > > >>>>> 2) > > > >>>>>>>>>>>> to > > > >>>>>>>>>>>>>> make > > > >>>>>>>>>>>>>>>>> sure > > > >>>>>>>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>>>> if we do try to expand this API in the future, > that > > > we > > > >>>>>>>>>>>>> won't > > > >>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>> painted > > > >>>>>>>>>>>>>>>>>>>>> into a corner. > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> For item 1, hopefully we can agree that the > > language > > > >>>>> in > > > >>>>>>>>>>>> the > > > >>>>>>>>>>>>>> KIP > > > >>>>>>>>>>>>>>>>>>>>> for IGNORE_SEND_ERRORS ("The records causing > > > >>>>>>>>>>>> irrecoverable > > > >>>>>>>>>>>>>>> errors > > > >>>>>>>>>>>>>>>>> are > > > >>>>>>>>>>>>>>>>>>>>> excluded from the batch and the transaction is > > > >>>>> committed > > > >>>>>>>>>>>>>>>>>>> successfully.") > > > >>>>>>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>>>>> pretty vague. If we start using the phrase > "poison > > > >>>>> pill > > > >>>>>>>>>>>>>> record" > > > >>>>>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>>> could > > > >>>>>>>>>>>>>>>>>>>>> help, but IMO more detail would still be needed. > We > > > >>>>> know > > > >>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>> want > > > >>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>> include records that are so large that they can > be > > > >>>>>>>>>>>>>> immediately > > > >>>>>>>>>>>>>>>>>> rejected > > > >>>>>>>>>>>>>>>>>>>> by > > > >>>>>>>>>>>>>>>>>>>>> the producer. But there are other cases that > users > > > >>>>> might > > > >>>>>>>>>>>>>> expect > > > >>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>>>>> handled. Does a record qualify as a poison pill > if > > it > > > >>>>>>>>>>>>>> targets a > > > >>>>>>>>>>>>>>>>> topic > > > >>>>>>>>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>>>> doesn't exist? Or if it targets a topic that the > > > >>>>>>> producer > > > >>>>>>>>>>>>>>>> principal > > > >>>>>>>>>>>>>>>>>>> lacks > > > >>>>>>>>>>>>>>>>>>>>> ACLs for? What if it fails broker-side validation > > > >>>>> (e.g., > > > >>>>>>>>>>>>> has > > > >>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>> null > > > >>>>>>>>>>>>>>>>>> key > > > >>>>>>>>>>>>>>>>>>>> for > > > >>>>>>>>>>>>>>>>>>>>> a compacted topic)? > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> For item 2, this really depends on how narrow the > > > >>>>> scope > > > >>>>>>>>>>>> of > > > >>>>>>>>>>>>>> what > > > >>>>>>>>>>>>>>>>> we're > > > >>>>>>>>>>>>>>>>>>>> doing > > > >>>>>>>>>>>>>>>>>>>>> right now is. If we only handle a subset of the > > > >>>>> examples > > > >>>>>>>>>>>> I > > > >>>>>>>>>>>>>> laid > > > >>>>>>>>>>>>>>>> out > > > >>>>>>>>>>>>>>>>>>> above > > > >>>>>>>>>>>>>>>>>>>>> that could possibly be considered poison pills > with > > > >>>>> this > > > >>>>>>>>>>>>> KIP, > > > >>>>>>>>>>>>>>> do > > > >>>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>>> want > > > >>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>> lock ourselves in to never addressing more in the > > > >>>>>>> future, > > > >>>>>>>>>>>>> or > > > >>>>>>>>>>>>>>> can > > > >>>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>>>> choose > > > >>>>>>>>>>>>>>>>>>>>> an API (probably just enum names would be the > only > > > >>>>>>>>>>>>> important > > > >>>>>>>>>>>>>>>>> decision > > > >>>>>>>>>>>>>>>>>>>> here) > > > >>>>>>>>>>>>>>>>>>>>> that leaves room for more later? > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> Best, > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> Chris > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> On Tue, Jul 2, 2024 at 12:28 PM Justine Olshan > > > >>>>>>>>>>>>>>>>>>>>> <jols...@confluent.io.invalid> > > > >>>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> Chris and Alieh, > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> My understanding is that this KIP is really only > > > >>>>> trying > > > >>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>> solve > > > >>>>>>>>>>>>>>>>> an > > > >>>>>>>>>>>>>>>>>>>> issue > > > >>>>>>>>>>>>>>>>>>>>>> of a "poison pill" record that fails send(). > > > >>>>>>>>>>>>>>>>>>>>>> We've talked a lot about having a generic > > framework > > > >>>>> for > > > >>>>>>>>>>>>> all > > > >>>>>>>>>>>>>>>>> errors, > > > >>>>>>>>>>>>>>>>>>>> but I > > > >>>>>>>>>>>>>>>>>>>>>> don't think that is what this KIP is trying to > do. > > > >>>>>>>>>>>>>>> Essentially > > > >>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>> request > > > >>>>>>>>>>>>>>>>>>>>>> is to undo the change from KAFKA-9279 > > > >>>>>>>>>>>>>>>>>>>>>> < > https://issues.apache.org/jira/browse/KAFKA-9279 > > > > > > >>>>> but > > > >>>>>>>>>>>>>> under > > > >>>>>>>>>>>>>>>>>>> specific > > > >>>>>>>>>>>>>>>>>>>>>> circumstances that are controlled. I really am > > > >>>>>>>>>>>> concerned > > > >>>>>>>>>>>>>>> about > > > >>>>>>>>>>>>>>>>>>> opening > > > >>>>>>>>>>>>>>>>>>>>> new > > > >>>>>>>>>>>>>>>>>>>>>> avenues for bugs with EOS and hesitate to handle > > any > > > >>>>>>>>>>>>> other > > > >>>>>>>>>>>>>>>> types > > > >>>>>>>>>>>>>>>>> of > > > >>>>>>>>>>>>>>>>>>>>> errors. > > > >>>>>>>>>>>>>>>>>>>>>> I think if we all agree on the problem that we > are > > > >>>>>>>>>>>> trying > > > >>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>> solve, > > > >>>>>>>>>>>>>>>>>>> it > > > >>>>>>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>>>>>> easier to agree on solutions. > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> Justine > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> On Mon, Jul 1, 2024 at 2:20 AM Alieh Saeedi > > > >>>>>>>>>>>>>>>>>>>> <asae...@confluent.io.invalid > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> Hi Matthias, > > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the valid points you mentioned. I > > > updated > > > >>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>> KIP > > > >>>>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>> PR > > > >>>>>>>>>>>>>>>>>>>>>>> with: > > > >>>>>>>>>>>>>>>>>>>>>>> 1) mentioning that the new overloaded `send` > > throws > > > >>>>>>>>>>>>>>>>>>>>>> `IllegalStateException` > > > >>>>>>>>>>>>>>>>>>>>>>> if the user tries to ignore `send()` errors > > outside > > > >>>>>>>>>>>> of > > > >>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>>>> transaction. > > > >>>>>>>>>>>>>>>>>>>>>>> 2) the default implementation in `Producer` > > > >>>>> interface > > > >>>>>>>>>>>>>>> throws > > > >>>>>>>>>>>>>>>> an > > > >>>>>>>>>>>>>>>>>>>>>>> `UnsupportedOperationException` > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> Hi Chris, > > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. I tried to clarify the > > > >>>>>>>>>>>> points > > > >>>>>>>>>>>>>> you > > > >>>>>>>>>>>>>>>>>> listed: > > > >>>>>>>>>>>>>>>>>>>>>>> -------> we've narrowed the scope from any > error > > > >>>>> that > > > >>>>>>>>>>>>>> might > > > >>>>>>>>>>>>>>>>> take > > > >>>>>>>>>>>>>>>>>>>> place > > > >>>>>>>>>>>>>>>>>>>>>> with > > > >>>>>>>>>>>>>>>>>>>>>>> producing a record to Kafka, to only the ones > > that > > > >>>>>>>>>>>> are > > > >>>>>>>>>>>>>>> thrown > > > >>>>>>>>>>>>>>>>>>>> directly > > > >>>>>>>>>>>>>>>>>>>>>> from > > > >>>>>>>>>>>>>>>>>>>>>>> Producer::send; > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> From the very beginning and even since > > KIP-1038, > > > >> the > > > >>>>>>>>>>>>> main > > > >>>>>>>>>>>>>>>>> purpose > > > >>>>>>>>>>>>>>>>>>> was > > > >>>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>>>> have "more flexibility," or, in other words, > > > "giving > > > >>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>> user > > > >>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>> authority" to handle some specific exceptions > > > thrown > > > >>>>>>>>>>>>> from > > > >>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>> `Producer`. > > > >>>>>>>>>>>>>>>>>>>>>>> Due to the specific cases we had in mind, > > KIP-1038 > > > >>>>>>>>>>>> was > > > >>>>>>>>>>>>>>>>> discarded > > > >>>>>>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>>>>>>> decided to not define a > `CustomExceptionHandler` > > > for > > > >>>>>>>>>>>>>>>> `Producer` > > > >>>>>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>>>>>>> instead > > > >>>>>>>>>>>>>>>>>>>>>>> treat the `send` failures in a different way. > The > > > >>>>>>>>>>>> main > > > >>>>>>>>>>>>>>> issue > > > >>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>>>>> `send` > > > >>>>>>>>>>>>>>>>>>>>>>> makes a transition to error state, which is > > > >>>>> undoable. > > > >>>>>>>>>>>>> In > > > >>>>>>>>>>>>>>>> fact, > > > >>>>>>>>>>>>>>>>>> one > > > >>>>>>>>>>>>>>>>>>>>> single > > > >>>>>>>>>>>>>>>>>>>>>>> poison pill record makes the whole batch fail. > > The > > > >>>>>>>>>>>>> former > > > >>>>>>>>>>>>>>>>>>> suggestions > > > >>>>>>>>>>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>>>>>> you agreed with have been all about un-doing > this > > > >>>>>>>>>>>>>>> transition > > > >>>>>>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>>>>> `flush` > > > >>>>>>>>>>>>>>>>>>>>>> or > > > >>>>>>>>>>>>>>>>>>>>>>> `commit`. The new suggestion is to un-do (or > > > better, > > > >>>>>>>>>>>>> NOT > > > >>>>>>>>>>>>>>> do) > > > >>>>>>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>>>>> `send` > > > >>>>>>>>>>>>>>>>>>>>>> due > > > >>>>>>>>>>>>>>>>>>>>>>> to the reasons listed in the discussions above. > > > >>>>>>>>>>>>>>>>>>>>>>> Moreover, I would say that having such a large > > > scope > > > >>>>>>>>>>>> as > > > >>>>>>>>>>>>>> you > > > >>>>>>>>>>>>>>>>>>> mentioned > > > >>>>>>>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>>>>>>> impossible. In the best case, we may have > control > > > >>>>>>>>>>>> over > > > >>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>> `Producer`. > > > >>>>>>>>>>>>>>>>>>>>>> What > > > >>>>>>>>>>>>>>>>>>>>>>> shall we do with the broker? The `any error > that > > > >>>>>>>>>>>> might > > > >>>>>>>>>>>>>> take > > > >>>>>>>>>>>>>>>>> place > > > >>>>>>>>>>>>>>>>>>>> with > > > >>>>>>>>>>>>>>>>>>>>>>> producing a record to Kafka` is too much, I > > think. > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> -------> is this all we want to handle, and > will > > it > > > >>>>>>>>>>>>>> prevent > > > >>>>>>>>>>>>>>>> us > > > >>>>>>>>>>>>>>>>>> from > > > >>>>>>>>>>>>>>>>>>>>>>> handling more in the future in an intuitive > way? > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> I think yes. This is all we want. Other sorts > of > > > >>>>>>>>>>>> errors > > > >>>>>>>>>>>>>>> such > > > >>>>>>>>>>>>>>>> as > > > >>>>>>>>>>>>>>>>>>>> having > > > >>>>>>>>>>>>>>>>>>>>>>> problem with partition addition, producer > fenced > > > >>>>>>>>>>>>>> exception, > > > >>>>>>>>>>>>>>>> etc > > > >>>>>>>>>>>>>>>>>>> seem > > > >>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>>>>>>> more serious issues. The intention was to > handle > > > >>>>>>>>>>>>> problems > > > >>>>>>>>>>>>>>>>> created > > > >>>>>>>>>>>>>>>>>>> by > > > >>>>>>>>>>>>>>>>>>>>>>> (maybe) a single poison pill record. BTW, I do > > not > > > >>>>>>>>>>>> see > > > >>>>>>>>>>>>>> any > > > >>>>>>>>>>>>>>>>>>> obstacles > > > >>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>>>> future changes. > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> Bests, > > > >>>>>>>>>>>>>>>>>>>>>>> Alieh > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> On Sat, Jun 29, 2024 at 3:03 AM Chris Egerton > > > >>>>>>>>>>>>>>>>>>>> <chr...@aiven.io.invalid > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> Ah, sorry--spoke too soon. The PR doesn't show > > > that > > > >>>>>>>>>>>>>>> errors > > > >>>>>>>>>>>>>>>>>> thrown > > > >>>>>>>>>>>>>>>>>>>>> from > > > >>>>>>>>>>>>>>>>>>>>>>>> Producer::send are handled, but instead, > > > >>>>>>>>>>>> ApiException > > > >>>>>>>>>>>>>>>>> instances > > > >>>>>>>>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>>>>> are > > > >>>>>>>>>>>>>>>>>>>>>>>> caught inside KafkaProducer::doSend and are > > > handled > > > >>>>>>>>>>>>> by > > > >>>>>>>>>>>>>>>>>> returning > > > >>>>>>>>>>>>>>>>>>> an > > > >>>>>>>>>>>>>>>>>>>>>>>> already-failed future are. I think the same > > > >>>>>>>>>>>> question > > > >>>>>>>>>>>>>>> still > > > >>>>>>>>>>>>>>>>>>> applies > > > >>>>>>>>>>>>>>>>>>>>> (is > > > >>>>>>>>>>>>>>>>>>>>>>> this > > > >>>>>>>>>>>>>>>>>>>>>>>> all we want to handle, and will it prevent us > > from > > > >>>>>>>>>>>>>>> handling > > > >>>>>>>>>>>>>>>>>> more > > > >>>>>>>>>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>>> future in an intuitive way), though. > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 28, 2024 at 8:57 PM Chris Egerton > < > > > >>>>>>>>>>>>>>>>> chr...@aiven.io > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>> Hi Alieh, > > > >>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>> This KIP has evolved a lot since I last > looked > > at > > > >>>>>>>>>>>>> it, > > > >>>>>>>>>>>>>>> but > > > >>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>> changes > > > >>>>>>>>>>>>>>>>>>>>>>>> seem > > > >>>>>>>>>>>>>>>>>>>>>>>>> well thought-out both in semantics and API. > One > > > >>>>>>>>>>>>>>>> clarifying > > > >>>>>>>>>>>>>>>>>>>>> question I > > > >>>>>>>>>>>>>>>>>>>>>>>> have > > > >>>>>>>>>>>>>>>>>>>>>>>>> is that it looks based on the draft PR that > > we've > > > >>>>>>>>>>>>>>>> narrowed > > > >>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>> scope > > > >>>>>>>>>>>>>>>>>>>>>>> from > > > >>>>>>>>>>>>>>>>>>>>>>>>> any error that might take place with > producing > > a > > > >>>>>>>>>>>>>> record > > > >>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>> Kafka, > > > >>>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>>>> only > > > >>>>>>>>>>>>>>>>>>>>>>>>> the ones that are thrown directly from > > > >>>>>>>>>>>>>> Producer::send; > > > >>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>>> intended > > > >>>>>>>>>>>>>>>>>>>>>>>>> behavior here? And if so, do you have > thoughts > > on > > > >>>>>>>>>>>>> how > > > >>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>> might > > > >>>>>>>>>>>>>>>>>>>>>> design a > > > >>>>>>>>>>>>>>>>>>>>>>>>> follow-up KIP that would catch all errors > > > >>>>>>>>>>>>> (including > > > >>>>>>>>>>>>>>> ones > > > >>>>>>>>>>>>>>>>>>>> reported > > > >>>>>>>>>>>>>>>>>>>>>>>>> asynchronously instead of synchronously)? I'd > > > >>>>>>>>>>>> like > > > >>>>>>>>>>>>> it > > > >>>>>>>>>>>>>>> if > > > >>>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>>> could > > > >>>>>>>>>>>>>>>>>>>>>> leave > > > >>>>>>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>>>> door open for that without painting ourselves > > > >>>>>>>>>>>> into > > > >>>>>>>>>>>>>> too > > > >>>>>>>>>>>>>>>> much > > > >>>>>>>>>>>>>>>>>> of > > > >>>>>>>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>>>>>>> corner > > > >>>>>>>>>>>>>>>>>>>>>>>>> with the API design for this KIP. > > > >>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>> Cheers, > > > >>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>> Chris > > > >>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 28, 2024 at 6:31 PM Matthias J. > > Sax < > > > >>>>>>>>>>>>>>>>>>>> mj...@apache.org> > > > >>>>>>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Alieh, > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> it seems this KIP can just pick between a > > couple > > > >>>>>>>>>>>>> of > > > >>>>>>>>>>>>>>>>>> tradeoffs. > > > >>>>>>>>>>>>>>>>>>>>>> Adding > > > >>>>>>>>>>>>>>>>>>>>>>> an > > > >>>>>>>>>>>>>>>>>>>>>>>>>> overloaded `send()` as the KIP propose makes > > > >>>>>>>>>>>> sense > > > >>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>> me > > > >>>>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>>>>> seems > > > >>>>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>>>>>>> provides the cleanest solution compare to > > there > > > >>>>>>>>>>>>>>> options > > > >>>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>>>>>> discussed. > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Given the explicit name of the passed-in > > option > > > >>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>> highlights > > > >>>>>>>>>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>>>>> option is for TX only make is pretty clear > and > > > >>>>>>>>>>>>>> avoids > > > >>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>> issue > > > >>>>>>>>>>>>>>>>>>>> of > > > >>>>>>>>>>>>>>>>>>>>>>>>>> `flush()` ambiguity. > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Nit: We should make clear on the KIP though, > > > >>>>>>>>>>>> that > > > >>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>> new > > > >>>>>>>>>>>>>>>>>>>> `send()` > > > >>>>>>>>>>>>>>>>>>>>>>>>>> overload would throw an > > `IllegalStateException` > > > >>>>>>>>>>>> if > > > >>>>>>>>>>>>>> TX > > > >>>>>>>>>>>>>>>> are > > > >>>>>>>>>>>>>>>>>> not > > > >>>>>>>>>>>>>>>>>>>> used > > > >>>>>>>>>>>>>>>>>>>>>>>>>> (similar to other TX methods like initTx(), > > etc) > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> About the `Producer` interface, I am not > sure > > > >>>>>>>>>>>> how > > > >>>>>>>>>>>>>> this > > > >>>>>>>>>>>>>>>> was > > > >>>>>>>>>>>>>>>>>>> done > > > >>>>>>>>>>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>>>>> past (eg, KIP-266 added > > > >>>>>>>>>>>> `Consumer.poll(Duration)` > > > >>>>>>>>>>>>>> w/o > > > >>>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>>>> default > > > >>>>>>>>>>>>>>>>>>>>>>>>>> implementation), if we need a default > > > >>>>>>>>>>>>> implementation > > > >>>>>>>>>>>>>>> for > > > >>>>>>>>>>>>>>>>>>>> backward > > > >>>>>>>>>>>>>>>>>>>>>>>>>> compatibility or not? If we do want to add > > one, > > > >>>>>>>>>>>> I > > > >>>>>>>>>>>>>>> think > > > >>>>>>>>>>>>>>>> it > > > >>>>>>>>>>>>>>>>>>> would > > > >>>>>>>>>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>>>>>>>>>> appropriate to throw an > > > >>>>>>>>>>>>>>> `UnsupportedOperationException` > > > >>>>>>>>>>>>>>>> by > > > >>>>>>>>>>>>>>>>>>>>> default, > > > >>>>>>>>>>>>>>>>>>>>>>>>>> instead of just keeping the default impl > > empty? > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> My points are rather minor, and should not > > block > > > >>>>>>>>>>>>>> this > > > >>>>>>>>>>>>>>>> KIP > > > >>>>>>>>>>>>>>>>>>>> though. > > > >>>>>>>>>>>>>>>>>>>>>>>>>> Overall LGTM. > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> On 6/27/24 1:28 PM, Alieh Saeedi wrote: > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Justine, > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Making applications to validate every > single > > > >>>>>>>>>>>>>> record > > > >>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>> not > > > >>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>> best > > > >>>>>>>>>>>>>>>>>>>>>>>> way, > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> from an efficiency point of view. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, between changing the behavior of > > the > > > >>>>>>>>>>>>>>>> Producer > > > >>>>>>>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>>>>>> `send` > > > >>>>>>>>>>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> `commitTnx`, the former seems more > reasonable > > > >>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>> clean. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Bests, > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Alieh > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 27, 2024 at 8:14 PM Justine > > Olshan > > > >>>>>>>>>>>>>>>>>>>>>>>>>> <jols...@confluent.io.invalid> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Alieh, > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> I see there are two options now. So folks > > > >>>>>>>>>>>> will > > > >>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>>> discussing > > > >>>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>>>>> approaches > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> and deciding the best way forward before > we > > > >>>>>>>>>>>>> vote? > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think there could be a problem with > the > > > >>>>>>>>>>>>>>> approach > > > >>>>>>>>>>>>>>>>> on > > > >>>>>>>>>>>>>>>>>>>> commit > > > >>>>>>>>>>>>>>>>>>>>>> if > > > >>>>>>>>>>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>>>>>>>>>> get > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> stuck on an earlier error and have more > > > >>>>>>>>>>>> records > > > >>>>>>>>>>>>>>>>>>> (potentially > > > >>>>>>>>>>>>>>>>>>>> on > > > >>>>>>>>>>>>>>>>>>>>>> new > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions) to commit as the current PR is > > > >>>>>>>>>>>>>>>> implemented. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> I guess this takes us back to the question > > of > > > >>>>>>>>>>>>>>> whether > > > >>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>> error > > > >>>>>>>>>>>>>>>>>>>>>>>> should > > > >>>>>>>>>>>>>>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> cleared on send. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> (And I guess at the back of my mind, I'm > > > >>>>>>>>>>>>>> wondering > > > >>>>>>>>>>>>>>> if > > > >>>>>>>>>>>>>>>>>> there > > > >>>>>>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>>>>>>>> way > > > >>>>>>>>>>>>>>>>>>>>>>>>>> we can > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> validate the "posion pill" records > > > >>>>>>>>>>>> application > > > >>>>>>>>>>>>>> side > > > >>>>>>>>>>>>>>>>>> before > > > >>>>>>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>>>>>> even > > > >>>>>>>>>>>>>>>>>>>>>>>> try > > > >>>>>>>>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> send them) > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Justine > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jun 26, 2024 at 4:38 PM Alieh > Saeedi > > > >>>>>>>>>>>>>>>>>>>>>>>>>> <asae...@confluent.io.invalid > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Justine, > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I did not update the KIP with > > > >>>>>>>>>>>> `TxnSendOption` > > > >>>>>>>>>>>>>>> since > > > >>>>>>>>>>>>>>>> I > > > >>>>>>>>>>>>>>>>>>>> thought > > > >>>>>>>>>>>>>>>>>>>>>> it'd > > > >>>>>>>>>>>>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> better discussed here beforehand. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> right now, there are 2 PRs: > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> - the PR that implements the current > > version > > > >>>>>>>>>>>>> of > > > >>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>> KIP: > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > https://github.com/apache/kafka/pull/16332 > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> - the POC PR that clarifies the > > > >>>>>>>>>>>>> `TxnSendOption`: > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > https://github.com/apache/kafka/pull/16465 > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bests, > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alieh > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 27, 2024 at 12:42 AM Justine > > > >>>>>>>>>>>>> Olshan > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <jols...@confluent.io.invalid> wrote: > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Alieh, > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think I am a little confused. Are the > 3 > > > >>>>>>>>>>>>>> points > > > >>>>>>>>>>>>>>>>> above > > > >>>>>>>>>>>>>>>>>>>>>> addressed > > > >>>>>>>>>>>>>>>>>>>>>>> by > > > >>>>>>>>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> KIP > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or did something change? The PR seems to > > > >>>>>>>>>>>> not > > > >>>>>>>>>>>>>>>> include > > > >>>>>>>>>>>>>>>>>> this > > > >>>>>>>>>>>>>>>>>>>>>> change > > > >>>>>>>>>>>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> still > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> has the CommitOption as well. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Justine > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jun 26, 2024 at 2:15 PM Alieh > > > >>>>>>>>>>>> Saeedi > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <asae...@confluent.io.invalid > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all, > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking at the PR < > > > >>>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/16332 > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> corresponding to the KIP, there are > some > > > >>>>>>>>>>>>>> points > > > >>>>>>>>>>>>>>>>> worthy > > > >>>>>>>>>>>>>>>>>>> of > > > >>>>>>>>>>>>>>>>>>>>>>> mention: > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) clearing the error ends up > dirty/messy > > > >>>>>>>>>>>>> code > > > >>>>>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> `TransactionManager`. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) By clearing the error, we are > actually > > > >>>>>>>>>>>>>> doing > > > >>>>>>>>>>>>>>> an > > > >>>>>>>>>>>>>>>>>>> illegal > > > >>>>>>>>>>>>>>>>>>>>>>>>>> transition > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `ABORTABLE_ERROR` to `IN_TRANSACTION` > > > >>>>>>>>>>>> which > > > >>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>>>> conceptually > > > >>>>>>>>>>>>>>>>>>>>>> not > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This can be the root cause of some > > issues, > > > >>>>>>>>>>>>>> with > > > >>>>>>>>>>>>>>>>>> perhaps > > > >>>>>>>>>>>>>>>>>>>>>> further > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> future > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> changes by others. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) If the poison pill record `r1` > causes > > a > > > >>>>>>>>>>>>>>>>> transition > > > >>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>> error > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> state > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and then the next record `r2` requires > > > >>>>>>>>>>>>> adding > > > >>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>>>> partition > > > >>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transaction, the action fails due to > > being > > > >>>>>>>>>>>>> in > > > >>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>> error > > > >>>>>>>>>>>>>>>>>>>>> state. > > > >>>>>>>>>>>>>>>>>>>>>>> In > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> this > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case, clearing errors during > > > >>>>>>>>>>>>>>>>>>> `commitTnx(CLEAR_SEND_ERROR)` > > > >>>>>>>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>>>>>>> too > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> late. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, this case can NOT be the main > > > >>>>>>>>>>>>> concern > > > >>>>>>>>>>>>>>> as > > > >>>>>>>>>>>>>>>>> soon > > > >>>>>>>>>>>>>>>>>>> as > > > >>>>>>>>>>>>>>>>>>>>>>> KIP-890 > > > >>>>>>>>>>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implemented. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> My suggestion is to solve the problem > > > >>>>>>>>>>>> where > > > >>>>>>>>>>>>> it > > > >>>>>>>>>>>>>>>>> arises. > > > >>>>>>>>>>>>>>>>>>> If > > > >>>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> transition > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the error state does not happen during > > > >>>>>>>>>>>>>> `send()`, > > > >>>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>> do > > > >>>>>>>>>>>>>>>>>>> not > > > >>>>>>>>>>>>>>>>>>>>>> need > > > >>>>>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> clear > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the error later. Therefore, instead of > > > >>>>>>>>>>>>>>>>> `CommitOption`, > > > >>>>>>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>>>>> can > > > >>>>>>>>>>>>>>>>>>>>>>>> define > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `TxnSendOption` and prevent the > `send()` > > > >>>>>>>>>>>>>> method > > > >>>>>>>>>>>>>>>> from > > > >>>>>>>>>>>>>>>>>>> going > > > >>>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> error > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> state in case 1) we're in a transaction > > > >>>>>>>>>>>> and > > > >>>>>>>>>>>>> 2) > > > >>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>> user > > > >>>>>>>>>>>>>>>>>>>>> asked > > > >>>>>>>>>>>>>>>>>>>>>>> for > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IGONRE_SEND_ERRORS. For more clarity, > you > > > >>>>>>>>>>>>> can > > > >>>>>>>>>>>>>>>> take a > > > >>>>>>>>>>>>>>>>>>> look > > > >>>>>>>>>>>>>>>>>>>> at > > > >>>>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>>>>>> POC > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> PR > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> < > > > >>>>>>>>>>>> https://github.com/apache/kafka/pull/16465 > > > >>>>>>>>>>>>>> . > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers, > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alieh > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > > >