Re: [DISCUSS] KIP-1038: Add Custom Error Handler to Producer

2024-05-09 Thread Artem Livshits
nfigs for Connect, but not for the
> > > producer itself)? For this case, we could also do this as a follow up
> > > KIP, but happy to include it in this KIP to provide value to Connect
> > > right away (even if the value might not come right away if we miss the
> > > 3.8 deadline due to expanded KIP scope...) --  For KS, we would for
> sure
> > > plugin our own impl, and lock down the config such that users cannot
> set
> > > their own handler on the internal producer to begin with. Might be good
> > > to elaborate why the producer should have a default? We might actually
> > > want to add this to the KIP right away?
> > >
> > > The key for a default impl would be, to not change the current
> behavior,
> > > and having no default seems to achieve this. For the two cases you
> > > mentioned, it's unclear to me what default value on "upper bound on
> > > retires" for UnkownTopicOrPartitionException we should set? Seems it
> > > would need to be the same as `delivery.timeout.ms`? However, if users
> > > have `delivery.timeout.ms` actually overwritten we would need to set
> > > this config somewhat "dynamic"? Is this feasible? If we hard-code 2
> > > minutes, it might not be backward compatible. I have the impression we
> > > might introduce some undesired coupling? -- For the "record too large"
> > > case, the config seems to be boolean and setting it to `false` by
> > > default seems to provide backward compatibility.
> > >
> > >
> > >
> > > @Artem:
> > >
> > > [AL1] While I see the point, I would think having a different callback
> > > for every exception might not really be elegant? In the end, the
> handler
> > > is an very advanced feature anyway, and if it's implemented in a bad
> > > way, well, it's a user error -- we cannot protect users from
> everything.
> > > To me, a handler like this, is to some extend "business logic" and if a
> > > user gets business logic wrong, it's hard to protect them. -- We would
> > > of course provide best practice guidance in the JaveDocs, and explain
> > > that a handler should have explicit `if` statements for stuff it want
> to
> > > handle, and only a single default which return FAIL.
> > >
> > >
> > > [AL2] Yes, but for KS we would retry at the application layer. Ie, the
> > > TX is not completed, we clean up and setup out task from scratch, to
> > > ensure the pending transaction is completed before we resume. If the TX
> > > was indeed aborted, we would retry from older offset and thus just hit
> > > the same error again and the loop begins again.
> > >
> > >
> > > [AL2 cont.] Similar to AL1, I see such a handler to some extend as
> > > business logic. If a user puts a bad filter condition in their KS app,
> > > and drops messages, it nothing we can do about it, and this handler
> > > IMHO, has a similar purpose. This is also the line of thinking I apply
> > > to EOS, to address Justin's concern about "should we allow to drop for
> > > EOS", and my answer is "yes", because it's more business logic than
> > > actual error handling IMHO. And by default, we fail... So users opt-in
> > > to add business logic to drop records. It's an application level
> > > decision how to write the code.
> > >
> > >
> > > [AL3] Maybe I misunderstand what you are saying, but to me, checking
> the
> > > size of the record upfront is exactly what the KIP proposes? No?
> > >
> > >
> > >
> > > @Justin:
> > >
> > > > I saw the sample
> > > > code -- is it just an if statement checking for the error before the
> > > > handler is invoked? That seems a bit fragile.
> > >
> > > What do you mean by fragile? Not sure if I see your point.
> > >
> > >
> > >
> > >
> > > -Matthias
> > >
> > > On 5/7/24 5:33 PM, Artem Livshits wrote:
> > > > Hi Alieh,
> > > >
> > > > Thanks for the KIP.  The motivation talks about very specific cases,
> > but
> > > > the interface is generic.
> > > >
> > > > [AL1]
> > > > If the interface evolves in the future I think we could have the
> > > following
> > > > confusion:
> > > >
> > > > 1. A user implemented SWALLOW action for both RecordTooLargeException
> > and
> > > > Unk

Re: [DISCUSS] KIP-1038: Add Custom Error Handler to Producer

2024-05-07 Thread Artem Livshits
Hi Alieh,

Thanks for the KIP.  The motivation talks about very specific cases, but
the interface is generic.

[AL1]
If the interface evolves in the future I think we could have the following
confusion:

1. A user implemented SWALLOW action for both RecordTooLargeException and
UnknownTopicOrPartitionException.  For simpicity they just return SWALLOW
from the function, because it elegantly handles all known cases.
2. The interface has evolved to support a new exception.
3. The user has upgraded their Kafka client.

Now a new kind of error gets dropped on the floor without user's intention
and it would be super hard to detect and debug.

To avoid the confusion, I think we should use handlers for specific
exceptions.  Then if a new exception is added it won't get silently swalled
because the user would need to add new functionality to handle it.

I also have some higher level comments:

[AL2]
> it throws a TimeoutException, and the user can only blindly retry, which
may result in an infinite retry loop

If the TimeoutException happens during transactional processing (exactly
once is the desired sematnics), then the client should not retry when it
gets TimeoutException because without knowing the reason for
TimeoutExceptions, the client cannot know whether the message got actually
produced or not and retrying the message may result in duplicatees.

> The thrown TimeoutException "cuts" the connection to the underlying root
cause of missing metadata

Maybe we should fix the error handling and return the proper underlying
message?  Then the application can properly handle the message based on
preferences.

>From the product perspective, it's not clear how safe it is to blindly
ignore UnknownTopicOrPartitionException.  This could lead to situations
when a simple typo could lead to massive data loss (part of the data would
effectively be produced to a "black hole" and the user may not notice it
for a while).

In which situations would you recommend the user to "black hole" messages
in case of misconfiguration?

[AL3]

> If the custom handler decides on SWALLOW for RecordTooLargeException,

Is it my understanding that this KIP proposes that functionality that would
only be able to SWALLOW RecordTooLargeException that happen because the
producer cannot produce the record (if the broker rejects the batch, the
error won't get to the handler, because we cannot know which other records
get ignored).  In this case, why not just check the locally configured max
record size upfront and not produce the recrord in the first place?  Maybe
we can expose a validation function from the producer that could validate
the records locally, so we don't need to produce the record in order to
know that it's invalid.

-Artem

On Tue, May 7, 2024 at 2:07 PM Justine Olshan 
wrote:

> Alieh and Chris,
>
> Thanks for clarifying 1) but I saw the motivation. I guess I just didn't
> understand how that would be ensured on the producer side. I saw the sample
> code -- is it just an if statement checking for the error before the
> handler is invoked? That seems a bit fragile.
>
> Can you clarify what you mean by `since the code does not reach the KS
> interface and breaks somewhere in producer.` If we surfaced this error to
> the application in a better way would that also be a solution to the issue?
>
> Justine
>
> On Tue, May 7, 2024 at 1:55 PM Alieh Saeedi 
> wrote:
>
> > Hi,
> >
> >
> > Thank you, Chris and Justine, for the feedback.
> >
> >
> > @Chris
> >
> > 1) Flexibility: it has two meanings. The first meaning is the one you
> > mentioned. We are going to cover more exceptions in the future, but as
> > Justine mentioned, we must be very conservative about adding more
> > exceptions. Additionally, flexibility mainly means that the user is able
> to
> > develop their own code. As mentioned in the motivation section and the
> > examples, sometimes the user decides on dropping a record based on the
> > topic, for example.
> >
> >
> > 2) Defining two separate methods for retriable and non-retriable
> > exceptions: although the idea is brilliant, the user may still make a
> > mistake by implementing the wrong method and see a non-expecting
> behaviour.
> > For example, he may implement handleRetriable() for
> RecordTooLargeException
> > and define SWALLOW for the exception, but in practice, he sees no change
> in
> > default behaviour since he implemented the wrong method. I think we can
> > never reduce the user’s mistakes to 0.
> >
> >
> >
> > 3) Default implementation for Handler: the default behaviour is already
> > preserved with NO need of implementing any handler or setting the
> > corresponding config parameter `custom.exception.handler`. What you mean
> is
> > actually having a second default, which requires having both interface
> and
> > config parameters. About UnknownTopicOrPartitionException: the producer
> > already offers the config parameter `max.block.ms` which determines the
> > duration of retrying. The main purpose of the 

[jira] [Resolved] (KAFKA-16352) Transaction may get get stuck in PrepareCommit or PrepareAbort state

2024-03-20 Thread Artem Livshits (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Livshits resolved KAFKA-16352.

Fix Version/s: 3.8.0
 Reviewer: Justine Olshan
   Resolution: Fixed

> Transaction may get get stuck in PrepareCommit or PrepareAbort state
> 
>
> Key: KAFKA-16352
> URL: https://issues.apache.org/jira/browse/KAFKA-16352
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>    Reporter: Artem Livshits
>    Assignee: Artem Livshits
>Priority: Major
> Fix For: 3.8.0
>
>
> A transaction took a long time to complete, trying to restart a producer 
> would lead to CONCURRENT_TRANSACTION errors.  Investigation has shown that 
> the transaction was stuck in PrepareCommit for a few days:
> (current time when the investigation happened: Feb 27 2024), transaction 
> state:
> {{Type   |Name                  |Value}}
> {{-}}
> {{ref    |transactionalId       |xxx-yyy}}
> {{long   |producerId            |299364}}
> {{ref    |state                 |kafka.coordinator.transaction.PrepareCommit$ 
> @ 0x44fe22760}}
> {{long   |txnStartTimestamp     |1708619624810  Thu Feb 22 2024 16:33:44.810 
> GMT+}}
> {{long   |txnLastUpdateTimestamp|1708619625335  Thu Feb 22 2024 16:33:45.335 
> GMT+}}
> {{-}}
> The partition list was empty and transactionsWithPendingMarkers didn't 
> contain the reference to the transactional state.  In the log there were the 
> following relevant messages:
> {{22 Feb 2024 @ 16:33:45.623 UTC [Transaction State Manager 1]: Completed 
> loading transaction metadata from __transaction_state-3 for coordinator epoch 
> 611}}
> (this is the partition that contains the transactional id).  After the data 
> is loaded, it sends out markers and etc.
> Then there is this message:
> {{22 Feb 2024 @ 16:33:45.696 UTC [Transaction Marker Request Completion 
> Handler 4]: Transaction coordinator epoch for xxx-yyy has changed from 610 to 
> 611; cancel sending transaction markers TxnMarkerEntry{producerId=299364, 
> producerEpoch=1005, coordinatorEpoch=610, result=COMMIT, 
> partitions=[foo-bar]} to the brokers}}
> this message is logged just before the state is removed 
> transactionsWithPendingMarkers, but the state apparently contained the entry 
> that was created by the load operation.  So the sequence of events probably 
> looked like the following:
>  # partition load completed
>  # commit markers were sent for transactional id xxx-yyy; entry in 
> transactionsWithPendingMarkers was created
>  # zombie reply from the previous epoch completed, removed entry from 
> transactionsWithPendingMarkers
>  # commit markers properly completed, but couldn't transition to 
> CommitComplete state because transactionsWithPendingMarkers didn't have the 
> proper entry, so it got stuck there until the broker was restarted
> Looking at the code there are a few cases that could lead to similar race 
> conditions.  The fix is to keep track of the PendingCompleteTxn value that 
> was used when sending the marker, so that we can only remove the state that 
> was created when the marker was sent and not accidentally remove the state 
> someone else created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-03-08 Thread Artem Livshits
Hi Justine,

>  Are you suggesting it should be called "transaction protocol version" or
"TPV"? I don't mind that, but just wanted to clarify if we want to include
protocol or if simply "transaction version" is enough.

My understanding is that "metadata version" is the version of metadata
records, which is fairly straightforward.  "Transaction version" may be
ambiguous.

-Artem

On Thu, Feb 29, 2024 at 3:39 PM Justine Olshan 
wrote:

> Hey folks,
>
> Thanks for the discussion. Let me try to cover everyone's comments.
>
> Artem --
> I can add the examples you mentioned. As for naming, right now the feature
> is called "transaction version" or "TV". Are you suggesting it should be
> called "transaction protocol version" or "TPV"? I don't mind that, but just
> wanted to clarify if we want to include protocol or if simply "transaction
> version" is enough.
>
> Jun --
>
> 10.  *With **more features, would each of those be controlled by a separate
> feature or*
>
> *multiple features. For example, is the new transaction record format*
>
> *controlled only by MV with TV having a dependency on MV or is it
> controlled*
>
> *by both MV and TV.*
>
>
> I think this will need to be decided on a case by case basis. There should
> be a mechanism to set dependencies among features.
> For transaction version specifically, I have no metadata version
> dependencies besides requiring 3.3 to write the feature records and use the
> feature tools. I would suspect all new features would have this
> requirement.
>
>
> 11. *Basically, if **--release-version is not used, the command will just
> use the latest*
>
> *production version of every feature. Should we apply that logic to both*
>
> *tools?*
>
>
> How would this work with the upgrade tool? I think we want a way to set a
> new feature version for one feature and not touch any of the others.
>
>
> *12. Should we remove --metadata METADATA from kafka-features? It does the*
>
> *same thing as --release-version.*
>
>
> When I previously discussed with Colin McCabe offline about this tool, he
> was strongly against deprecation or changing flags. I personally think it
> could good
>
> to unify and not support a ton of flags, but I would want to make sure he
> is aligned.
>
>
> *13. KIP-853 also extends the tools to support a new feature
> kraft.version.*
>
> *It would be useful to have alignment between that KIP and this one.*
>
>
> Sounds good. Looks like Jose is in on the discussion so we can continue
> here. :)
>
>
>
> Jose --
>
>
> *1. KIP-853 uses --feature for kafka-storage instead of --features.*
>
> *This is consistent with the use of --feature in the "kafka-feature.sh*
>
> *upgrade" command.*
>
>
> I wanted to include multiple features in one command, so it seems like
> features is a better name. I discuss more below about why I think we should
> allow setting multiple features at once.
>
>
> *2. I find it a bit inconsistent that --feature and --release-version*
>
> *are mutually exclusive in the kafka-feature CLI but not in the*
>
> *kafka-storage CLI. What is the reason for this decision?*
>
>
> For the storage tool, we are setting all the features for the cluster. By
> default, all are set. For the upgrade tool, the default is to set one
> feature. In the storage tool, it is natural for the --release-version to
> set the remaining features that --features didn't cover since otherwise we
> would need to set them all
>
> If we use the flag. In the feature upgrade case, it is less necessary for
> all the features to be set at once and the tool can be run multiple times.
> I'm not opposed to allowing both flags, but it is less necessary in my
> opinion.
>
>
> *3. KIP-853 deprecates --metadata in the kafka-features and makes it an*
>
> *alias for --release-version. In KIP-1022, what happens if the user*
>
> *specified both --metadata and --feature?*
>
>
> See my note above (Jun's comment 12) about deprecating old commands. I
> would think as the KIP stands now, we would not accept both commands.
>
>
> *4. I would suggest keeping this*
>
> *consistent with kafka-features. It would avoid having to implement one*
>
> *more parser in Kafka.*
>
>
> I sort of already implemented it as such, so I don't think it is too
> tricky. I'm not sure of an alternative. Kafka features currently only
> supports one feature at a time.
> I would like to support more than one for the storage tool. Do you have
> another suggestion for multiple features in the storage tool?
>
>
> *5. As currently described, trial and error seem to be the*
>
> *only mechanism. Should the Kafka documentation describe these*
>
> *dependencies? Is that good enough?*
>
>
> The plan so far is documentation. The idea is that this is an advanced
> feature, so I think it is reasonable to ask folks use documentation
>
>
> *6. Did you mean that 3.8-IV4 would map to TV2? If*
>
> *not, 3.8-IV3 would map to two different TV values.*
>
>
> It was a typo. Each MV maps to a single other feature version.
>
>
> *7. For 

[jira] [Created] (KAFKA-16352) Transaction may get get stuck in PrepareCommit or PrepareAbort state

2024-03-07 Thread Artem Livshits (Jira)
Artem Livshits created KAFKA-16352:
--

 Summary: Transaction may get get stuck in PrepareCommit or 
PrepareAbort state
 Key: KAFKA-16352
 URL: https://issues.apache.org/jira/browse/KAFKA-16352
 Project: Kafka
  Issue Type: Bug
  Components: core
Reporter: Artem Livshits
Assignee: Artem Livshits


A transaction took a long time to complete, trying to restart a producer would 
lead to CONCURRENT_TRANSACTION errors.  Investigation has shown that the 
transaction was stuck in PrepareCommit for a few days:

(current time when the investigation happened: Feb 27 2024), transaction state:

{{Type   |Name                  |Value}}
{{-}}
{{ref    |transactionalId       |xxx-yyy}}
{{long   |producerId            |299364}}
{{ref    |state                 |kafka.coordinator.transaction.PrepareCommit$ @ 
0x44fe22760}}
{{long   |txnStartTimestamp     |1708619624810  Thu Feb 22 2024 16:33:44.810 
GMT+}}
{{long   |txnLastUpdateTimestamp|1708619625335  Thu Feb 22 2024 16:33:45.335 
GMT+}}
{{-}}

The partition list was empty and transactionsWithPendingMarkers didn't contain 
the reference to the transactional state.  In the log there were the following 
relevant messages:

{{22 Feb 2024 @ 16:33:45.623 UTC [Transaction State Manager 1]: Completed 
loading transaction metadata from __transaction_state-3 for coordinator epoch 
611}}

(this is the partition that contains the transactional id).  After the data is 
loaded, it sends out markers and etc.

Then there is this message:

{{22 Feb 2024 @ 16:33:45.696 UTC [Transaction Marker Request Completion Handler 
4]: Transaction coordinator epoch for xxx-yyy has changed from 610 to 611; 
cancel sending transaction markers TxnMarkerEntry\{producerId=299364, 
producerEpoch=1005, coordinatorEpoch=610, result=COMMIT, partitions=[foo-bar]} 
to the brokers}}

this message is logged just before the state is removed 
transactionsWithPendingMarkers, but the state apparently contained the entry 
that was created by the load operation.  So the sequence of events probably 
looked like the following:
 # partition load completed
 # commit markers were sent for transactional id xxx-yyy; entry in 
transactionsWithPendingMarkers was created
 # zombie reply from the previous epoch completed, removed entry from 
transactionsWithPendingMarkers
 # commit markers properly completed, but couldn't transition to CommitComplete 
state because transactionsWithPendingMarkers didn't have the proper entry, so 
it got stuck there until the broker was restarted

Looking at the code there are a few cases that could lead to similar race 
conditions.  The fix it to keep track of the PendingCompleteTxn value that was 
used when sending the marker, so that we can only remove the state that was 
created when the marker was sent and not accidentally remove the state someone 
else created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-03-01 Thread Artem Livshits
Hi Jun,

> 32. ... metric name ...

I've updated the metric name to be
*kafka.coordinator.transaction:type=TransactionStateManager,name=ActiveTransactionOpenTimeMax.*

Let me know if it works.

-Artem



On Thu, Feb 29, 2024 at 12:03 PM Artem Livshits 
wrote:

> Hi Jun,
>
> >  So, it doesn't provide the same guarantees as 2PC either.
>
> I think the key point is that we don't claim 2PC guarantees in that case.
> Maybe it's splitting hairs from the technical perspective (in the end of
> the day if the operator doesn't let the user use 2PC, it's going to be a
> "works until timeout" solution), but from user model perspective it
> provides a clear structure:
>
> - if 2PC is possible then all guarantees are in place and there is no gray
> area where we sort of provide guarantees but not fully
> - if 2PC is not possible, then it's a well-informed constrain / decision
> with well-known characteristics and the user can choose whether this is
> acceptable or not for them
>
> Maybe we can look at it from a slightly different perspective: we are not
> making a choice between allowing or not allowing using keepPrepareTxn=true
> when 2PC=false (even though that's exactly how it looks from the KIP).  In
> fact, we're making a choice is whether Flink will be able to use an
> official API when 2PC is not possible (and I think we've converged to agree
> that sometimes it won't be) or keep using a reflection hack.  In other
> words, we already have a hacky implementation for the case of
> keepPrepareTxn=true + 2PC=false, our choice is only whether we provide an
> official API for that or not.
>
> In general, if someone goes and implements a reflection-based solution
> that's an indication that there is a gap in public APIs.  And we can debate
> whether keepPreparedTxn=true + 2PC=false is the right API or not; and if we
> think it's not, then we should provide an alternative.  Right now the
> alternative is to just keep using the reflection and I think it's always
> worse than using a public API.
>
> -Artem
>
> On Wed, Feb 28, 2024 at 2:23 PM Jun Rao  wrote:
>
>> Hi, Artem,
>>
>> Thanks for the reply.
>>
>> I understand your concern on having a timeout breaking the 2PC guarantees.
>> However, the fallback plan to disable 2PC with an independent
>> keepPreparedTxn is subject to the timeout too. So, it doesn't provide the
>> same guarantees as 2PC either.
>>
>> To me, if we provide a new functionality, we should make it easy such that
>> the application developer only needs to implement it in one way, which is
>> always correct. Then, we can consider what additional things are needed to
>> make the operator comfortable enabling it.
>>
>> Jun
>>
>> On Tue, Feb 27, 2024 at 4:45 PM Artem Livshits
>>  wrote:
>>
>> > Hi Jun,
>> >
>> > Thank you for the discussion.
>> >
>> > > For 3b, it would be useful to understand the reason why an admin
>> doesn't
>> > authorize 2PC for self-hosted Flink
>> >
>> > I think the nuance here is that for cloud, there is a cloud admin
>> > (operator) and there is cluster admin (who, for example could manage
>> acls
>> > on topics or etc.).  The 2PC functionality can affect cloud operations,
>> > because a long running transaction can block the last stable offset and
>> > prevent compaction or data tiering.  In a multi-tenant environment, a
>> long
>> > running transaction that involves consumer offset may affect data that
>> is
>> > shared by multiple tenants (Flink transactions don't use consumer
>> offsets,
>> > so this is not an issue for Flink, but we'd need a separate ACL or some
>> > other way to express this permission if we wanted to go in that
>> direction).
>> >
>> > For that reason, I expect 2PC to be controlled by the cloud operator
>> and it
>> > just may not be scalable for the cloud operator to manage all potential
>> > interactions required to resolve in-doubt transactions (communicate to
>> the
>> > end users, etc.).  In general, we make no assumptions about Kafka
>> > applications -- they may come and go, they may abandon transactional ids
>> > and generate new ones.  For 2PC we need to make sure that the
>> application
>> > is highly available and wouldn't easily abandon an open transaction in
>> > Kafka.
>> >
>> > > If so, another way to address that is to allow the admin to set a
>> timeout
>> > even for the 2PC case.
>> >
>> > This effectively abandons the 2PC guarantee because it creates a case
>> for

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-02-29 Thread Artem Livshits
Hi Jun,

>  So, it doesn't provide the same guarantees as 2PC either.

I think the key point is that we don't claim 2PC guarantees in that case.
Maybe it's splitting hairs from the technical perspective (in the end of
the day if the operator doesn't let the user use 2PC, it's going to be a
"works until timeout" solution), but from user model perspective it
provides a clear structure:

- if 2PC is possible then all guarantees are in place and there is no gray
area where we sort of provide guarantees but not fully
- if 2PC is not possible, then it's a well-informed constrain / decision
with well-known characteristics and the user can choose whether this is
acceptable or not for them

Maybe we can look at it from a slightly different perspective: we are not
making a choice between allowing or not allowing using keepPrepareTxn=true
when 2PC=false (even though that's exactly how it looks from the KIP).  In
fact, we're making a choice is whether Flink will be able to use an
official API when 2PC is not possible (and I think we've converged to agree
that sometimes it won't be) or keep using a reflection hack.  In other
words, we already have a hacky implementation for the case of
keepPrepareTxn=true + 2PC=false, our choice is only whether we provide an
official API for that or not.

In general, if someone goes and implements a reflection-based solution
that's an indication that there is a gap in public APIs.  And we can debate
whether keepPreparedTxn=true + 2PC=false is the right API or not; and if we
think it's not, then we should provide an alternative.  Right now the
alternative is to just keep using the reflection and I think it's always
worse than using a public API.

-Artem

On Wed, Feb 28, 2024 at 2:23 PM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply.
>
> I understand your concern on having a timeout breaking the 2PC guarantees.
> However, the fallback plan to disable 2PC with an independent
> keepPreparedTxn is subject to the timeout too. So, it doesn't provide the
> same guarantees as 2PC either.
>
> To me, if we provide a new functionality, we should make it easy such that
> the application developer only needs to implement it in one way, which is
> always correct. Then, we can consider what additional things are needed to
> make the operator comfortable enabling it.
>
> Jun
>
> On Tue, Feb 27, 2024 at 4:45 PM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > Thank you for the discussion.
> >
> > > For 3b, it would be useful to understand the reason why an admin
> doesn't
> > authorize 2PC for self-hosted Flink
> >
> > I think the nuance here is that for cloud, there is a cloud admin
> > (operator) and there is cluster admin (who, for example could manage acls
> > on topics or etc.).  The 2PC functionality can affect cloud operations,
> > because a long running transaction can block the last stable offset and
> > prevent compaction or data tiering.  In a multi-tenant environment, a
> long
> > running transaction that involves consumer offset may affect data that is
> > shared by multiple tenants (Flink transactions don't use consumer
> offsets,
> > so this is not an issue for Flink, but we'd need a separate ACL or some
> > other way to express this permission if we wanted to go in that
> direction).
> >
> > For that reason, I expect 2PC to be controlled by the cloud operator and
> it
> > just may not be scalable for the cloud operator to manage all potential
> > interactions required to resolve in-doubt transactions (communicate to
> the
> > end users, etc.).  In general, we make no assumptions about Kafka
> > applications -- they may come and go, they may abandon transactional ids
> > and generate new ones.  For 2PC we need to make sure that the application
> > is highly available and wouldn't easily abandon an open transaction in
> > Kafka.
> >
> > > If so, another way to address that is to allow the admin to set a
> timeout
> > even for the 2PC case.
> >
> > This effectively abandons the 2PC guarantee because it creates a case for
> > Kafka to unilaterally make an automatic decision on a prepared
> > transaction.  I think it's fundamental for 2PC to abandon this ability
> and
> > wait for the external coordinator for the decision, after all the
> > coordinator may legitimately be unavailable for an arbitrary amount of
> > time.  Also, we already have a timeout on regular Kafka transactions,
> > having another "special" timeout could be confusing, and a large enough
> > timeout could still produce the undesirable effects for the cloud
> > operations (so we kind of get worst of both options -- we don't provide
> > guarantees and still have impact o

Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-02-28 Thread Artem Livshits
Hi Justine,

Thank you for the KIP.  I think the KIP is pretty clear and makes sense to
me.  Maybe it would be good to give a little more detail on the
implementation of feature mapping and how the tool would validate the
feature combinations.  For example, I'd expect that

bin/kafka-storage.sh format --release-version 3.6-IVI --feature
transaction.version=2

would give an error because the new transaction protocol is not supported
in 3.6.  Also, we may decide that

bin/kafka-storage.sh format --release-version 5.0-IV0 --feature
transaction.version=0

would be an unsupported combination as it'll have been a while since the
new transaction protocol has been the default and it would be too risky to
enable this combination as it may not be tested any more.

As for the new names, I'm thinking of the "transaction feature version"
more like a "transaction protocol version" -- from the user perspective we
don't really add new functionality in KIP-890, we're changing the protocol
to be more robust (and potentially faster).

-Artem



On Wed, Feb 28, 2024 at 10:08 AM Justine Olshan
 wrote:

> Hey Andrew,
>
> Thanks for taking a look.
>
> I previously didn't include 1. We do plan to use these features immediately
> for KIP-890 and KIP-848. If we think it is easier to put the discussion in
> those KIP discussions we can, but I fear that it will easily get lost given
> the size of the KIPs.
>
> I named the features similar to how we named metadata version. Transaction
> version would control transaction features like enabling a new transaction
> record format and APIs to enable KIP-890 part 2. Likewise, the group
> coordinator version would also enable the new record formats there and the
> new group coordinator. I am open to new names or further discussion.
>
> For 2 and 3, I can provide example scripts that show the usage. I am open
> to adding --latest-stable as well.
>
> Justine
>
> On Tue, Feb 27, 2024 at 4:59 AM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
> > Hi Justine,
> > Thanks for the KIP. This area of Kafka is complicated and making it
> easier
> > is good.
> >
> > When I use the `kafka-features.sh` tool to describe the features on my
> > cluster, I find that there’s a
> > single feature called “metadata.version”. I think this KIP does a handful
> > of things:
> >
> > 1) It introduces the idea of two new features, TV and GCV, without giving
> > them concrete names or
> > describing their behaviour.
> > 2) It introduces a new flag on the storage tool to enable advanced users
> > to control individual features
> > when they format storage for a new broker.
> > 3) It introduces a new flag on the features tool to enable a set of
> latest
> > stable features for a given
> > version to be enabled all together.
> >
> > I think that (1) probably shouldn’t be in this KIP unless there are
> > concrete details. Maybe this KIP is enabling
> > the operator experience when we introduce TV and GCV in other KIPs. I
> > don’t believe the plan is to enable
> > the new group coordinator with a feature, and it seems unnecessary to me.
> > I think it’s more compelling for TV
> > given the changes in transactions.
> >
> > For (2) and (3), it would be helpful to explicit about the syntax for the
> > enhancements to the tool. I think
> > that for the features tool, `--release-version` is an optional parameter
> > which requires a RELEASE_VERSION
> > argument. I wonder whether it would be helpful to have `--latest-stable`
> > as an option too.
> >
> > Thanks,
> > Andrew
> >
> > > On 26 Feb 2024, at 21:26, Justine Olshan  >
> > wrote:
> > >
> > > Hello folks,
> > >
> > > I'm proposing a KIP that allows for setting and upgrading new features
> > > (other than metadata version) via the kafka storage format and feature
> > > tools. This KIP extends on the feature versioning changes introduced by
> > > KIP-584 by allowing for the features to be set and upgraded.
> > >
> > > Please take a look:
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Formatting+and+Updating+Features
> > >
> > > Thanks,
> > >
> > > Justine
> >
> >
>


Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-02-27 Thread Artem Livshits
Hi Jun,

Thank you for the discussion.

> For 3b, it would be useful to understand the reason why an admin doesn't
authorize 2PC for self-hosted Flink

I think the nuance here is that for cloud, there is a cloud admin
(operator) and there is cluster admin (who, for example could manage acls
on topics or etc.).  The 2PC functionality can affect cloud operations,
because a long running transaction can block the last stable offset and
prevent compaction or data tiering.  In a multi-tenant environment, a long
running transaction that involves consumer offset may affect data that is
shared by multiple tenants (Flink transactions don't use consumer offsets,
so this is not an issue for Flink, but we'd need a separate ACL or some
other way to express this permission if we wanted to go in that direction).

For that reason, I expect 2PC to be controlled by the cloud operator and it
just may not be scalable for the cloud operator to manage all potential
interactions required to resolve in-doubt transactions (communicate to the
end users, etc.).  In general, we make no assumptions about Kafka
applications -- they may come and go, they may abandon transactional ids
and generate new ones.  For 2PC we need to make sure that the application
is highly available and wouldn't easily abandon an open transaction in
Kafka.

> If so, another way to address that is to allow the admin to set a timeout
even for the 2PC case.

This effectively abandons the 2PC guarantee because it creates a case for
Kafka to unilaterally make an automatic decision on a prepared
transaction.  I think it's fundamental for 2PC to abandon this ability and
wait for the external coordinator for the decision, after all the
coordinator may legitimately be unavailable for an arbitrary amount of
time.  Also, we already have a timeout on regular Kafka transactions,
having another "special" timeout could be confusing, and a large enough
timeout could still produce the undesirable effects for the cloud
operations (so we kind of get worst of both options -- we don't provide
guarantees and still have impact on operations).

-Artem

On Fri, Feb 23, 2024 at 8:55 AM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply.
>
> For 3b, it would be useful to understand the reason why an admin doesn't
> authorize 2PC for self-hosted Flink. Is the main reason that 2PC has
> unbounded timeout that could lead to unbounded outstanding transactions? If
> so, another way to address that is to allow the admin to set a timeout even
> for the 2PC case. The timeout would be long enough for behavioring
> applications to complete 2PC operations, but not too long for non-behaving
> applications' transactions to hang.
>
> Jun
>
> On Wed, Feb 21, 2024 at 4:34 PM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > > 20A. One option is to make the API initTransactions(boolean enable2PC).
> >
> > We could do that.  I think there is a little bit of symmetry between the
> > client and server that would get lost with this approach (server has
> > enable2PC as config), but I don't really see a strong reason for
> enable2PC
> > to be a config vs. an argument for initTransactions.  But let's see if we
> > find 20B to be a strong consideration for keeping a separate flag for
> > keepPreparedTxn.
> >
> > > 20B. But realistically, we want Flink (and other apps) to have a single
> > implementation
> >
> > That's correct and here's what I think can happen if we don't allow
> > independent keepPreparedTxn:
> >
> > 1. Pre-KIP-939 self-hosted Flink vs. any Kafka cluster -- reflection is
> > used, which effectively implements keepPreparedTxn=true without our
> > explicit support.
> > 2. KIP-939 self-hosted Flink vs. pre-KIP-939 Kafka cluster -- we can
> > either fall back to reflection or we just say we don't support this, have
> > to upgrade Kafka cluster first.
> > 3. KIP-939 self-hosted Flink vs. KIP-939 Kafka cluster, this becomes
> > interesting depending on whether the Kafka cluster authorizes 2PC or not:
> >  3a. Kafka cluster autorizes 2PC for self-hosted Flink -- everything uses
> > KIP-939 and there is no problem
> >  3b. Kafka cluster doesn't authorize 2PC for self-hosted Flink -- we can
> > either fallback to reflection or use keepPreparedTxn=true even if 2PC is
> > not enabled.
> >
> > It seems to be ok to not support case 2 (i.e. require Kafka upgrade
> first),
> > it shouldn't be an issue for cloud offerings as cloud providers are
> likely
> > to upgrade their Kafka to the latest versions.
> >
> > The case 3b seems to be important to support, though -- the latest
> version
> > of everything should work at least as well (and preferably better) than
> > previous ones.  It's possible to 

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-02-21 Thread Artem Livshits
Hi Jun,

> 20A. One option is to make the API initTransactions(boolean enable2PC).

We could do that.  I think there is a little bit of symmetry between the
client and server that would get lost with this approach (server has
enable2PC as config), but I don't really see a strong reason for enable2PC
to be a config vs. an argument for initTransactions.  But let's see if we
find 20B to be a strong consideration for keeping a separate flag for
keepPreparedTxn.

> 20B. But realistically, we want Flink (and other apps) to have a single
implementation

That's correct and here's what I think can happen if we don't allow
independent keepPreparedTxn:

1. Pre-KIP-939 self-hosted Flink vs. any Kafka cluster -- reflection is
used, which effectively implements keepPreparedTxn=true without our
explicit support.
2. KIP-939 self-hosted Flink vs. pre-KIP-939 Kafka cluster -- we can
either fall back to reflection or we just say we don't support this, have
to upgrade Kafka cluster first.
3. KIP-939 self-hosted Flink vs. KIP-939 Kafka cluster, this becomes
interesting depending on whether the Kafka cluster authorizes 2PC or not:
 3a. Kafka cluster autorizes 2PC for self-hosted Flink -- everything uses
KIP-939 and there is no problem
 3b. Kafka cluster doesn't authorize 2PC for self-hosted Flink -- we can
either fallback to reflection or use keepPreparedTxn=true even if 2PC is
not enabled.

It seems to be ok to not support case 2 (i.e. require Kafka upgrade first),
it shouldn't be an issue for cloud offerings as cloud providers are likely
to upgrade their Kafka to the latest versions.

The case 3b seems to be important to support, though -- the latest version
of everything should work at least as well (and preferably better) than
previous ones.  It's possible to downgrade to case 1, but it's probably not
sustainable as newer versions of Flink would also add other features that
the customers may want to take advantage of.

If we enabled keepPreparedTxn=true even without 2PC, then we could enable
case 3b without the need to fall back to reflection, so we could get rid of
reflection-based logic and just have a single implementation based on
KIP-939.

> 32. My suggestion is to change

Let me think about it and I'll come back to this.

-Artem

On Wed, Feb 21, 2024 at 3:40 PM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply.
>
> 20A. One option is to make the API initTransactions(boolean enable2PC).
> Then, it's clear from the code whether 2PC related logic should be added.
>
> 20B. But realistically, we want Flink (and other apps) to have a single
> implementation of the 2PC logic, not two different implementations, right?
>
> 32. My suggestion is to
> change
> kafka.server:type=transaction-coordinator-metrics,name=active-transaction-open-time-max
> to sth like
> Metric NameType  Group
> Tags   Description
> active-transaction-open-time-max   Max   transaction-coordinator-metrics
>  none  The max time a currently-open transaction has been open
>
> Jun
>
> On Wed, Feb 21, 2024 at 11:25 AM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > > 20A.  This only takes care of the abort case. The application still
> needs
> > to be changed to handle the commit case properly
> >
> > My point here is that looking at the initTransactions() call it's not
> clear
> > what the semantics is.  Say I'm doing code review, I cannot say if the
> code
> > is correct or not -- if the config (that's something that's
> > theoretically not known at the time of code review) is going to enable
> 2PC,
> > then the correct code should look one way, otherwise it would need to
> look
> > differently.  Also, say if code is written with InitTransaction() without
> > explicit abort and then for whatever reason the code would get used with
> > 2PC enabled (could be a library in a bigger product) it'll start breaking
> > in a non-intuitive way.
> >
> > > 20B. Hmm, if the admin disables 2PC, there is likely a reason behind
> that
> >
> > That's true, but reality may be more complicated.  Say a user wants to
> run
> > a self-managed Flink with Confluent cloud.  Confluent cloud adim may not
> > be comfortable enabling 2PC to general user accounts that use services
> not
> > managed by Confluent (the same way Confluent doesn't allow increasing max
> > transaction timeout for general user accounts).  Right now, self-managed
> > Flink works because it uses reflection, if it moves to use public APIs
> > provided by KIP-939 it'll break.
> >
> > > 32. Ok. That's the kafka metric. In that case, the metric name has a
> > group and a name. There is no type and no package name.
> >
> > Is this a suggestion to change or confirmation that the current logic is
> > 

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-02-21 Thread Artem Livshits
Hi Rowland,

> The Open Group DTP model and the XA interface requires that resource
managers be able to report prepared transactions only, so a prepare RPC
will be required.

It's required in the XA protocol, but I'm not sure we have to build it into
a Kafka.

Looks like we just need a catalog of prepared transactions and I wonder if
XA protocol could implement it outside of Kafka transactional state.  As an
example you can take a look at Flink that keeps track of prepared
transactions in its own storage.  I think it would be desirable if all
protocols kept their details outside of Kafka, so that we keep Kafka to be
the most open and protocol agnostic (and most efficient and simple) system.

-Artem

On Mon, Feb 19, 2024 at 12:13 PM Rowland Smith  wrote:

> Hi Artem,
>
> I think that we both have the same understanding. An explicit prepare RPC
> does not eliminate any conditions, it just reduces the window for possible
> undesirable conditions like pending in-doubt transactions. So there is no
> right or wrong answer, a prepare RPC will reduce the number of
> occurrences of in-doubt transactions, but with a performance cost of an
> extra RPC call on every transaction.
>
> The Open Group DTP model and the XA interface requires that resource
> managers be able to report prepared transactions only, so a prepare RPC
> will be required. I will include it in my KIP for XA interface support, and
> will propose an implementation where clients can choose whether they want a
> prepare RPC when not using the XA interface. How does that sound?
>
> - Rowland
>
> On Fri, Feb 16, 2024 at 7:15 PM Artem Livshits
>  wrote:
>
> > Hi Rowland,
> >
> > > I am not sure what you mean by guarantee,
> >
> > A guarantee would be an elimination of complexity or a condition.  E.g.
> if
> > adding an explicit prepare RPC eliminated in-doubt transactions, or
> > eliminated a significant complexity in implementation.
> >
> > > 1. Transactions that haven’t reached “prepared” state can be aborted
> via
> > timeout.
> >
> > The argument is that it doesn't eliminate any conditions, it merely
> reduces
> > a subset of circumstances for the conditions to happen, but the
> conditions
> > still happen and must be handled.  The operator still needs to set up
> > monitoring for run-away transactions, there still needs to be an
> > "out-of-band" channel to resolve run-away transactions (i.e. the
> operation
> > would need a way that's not a part of the 2PC protocol to reconcile with
> > the application owner), there still needs to be tooling for resolving
> > run-away transactions.
> >
> > On the downside, an explicit prepare RPC would have a performance hit on
> > the happy path in every single transaction.
> >
> > -Artem
> >
> > On Tue, Feb 6, 2024 at 7:35 PM Rowland Smith  wrote:
> >
> > > Hi Artem,
> > >
> > > I am not sure what you mean by guarantee, but I am referring to a
> better
> > > operational experience. You mentioned this as the first benefit of an
> > > explicit "prepare" RPC in the KIP.
> > >
> > >
> > > 1. Transactions that haven’t reached “prepared” state can be aborted
> via
> > > timeout.
> > >
> > > However, in explaining why an explicit "prepare" RPC was not included
> in
> > > the design, you make no further mention of this benefit. So what I am
> > > saying is this benefit is quite significant operationally. Many client
> > > application failures may occur before the transaction reaches the
> > prepared
> > > state, and the ability to automatically abort those transactions and
> > > unblock affected partitions without administrative intervention or fast
> > > restart of the client would be a worthwhile benefit. An explicit
> > "prepare"
> > > RPC will also be needed by the XA implementation, so I would like to
> see
> > it
> > > implemented for that reason. Otherwise, I will need to add this work to
> > my
> > > KIP.
> > >
> > > - Rowland
> > >
> > > On Mon, Feb 5, 2024 at 9:35 PM Artem Livshits
> > >  wrote:
> > >
> > > > Hi Rowland,
> > > >
> > > > Thank you for your reply.  I think I understand what you're saying
> and
> > > just
> > > > tried to provide a quick summary.  The
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC#KIP939:SupportParticipationin2PC-Explicit%E2%80%9Cprepare%E2%80%9DRPC
> > >

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-02-21 Thread Artem Livshits
Hi Jun,

> 20A.  This only takes care of the abort case. The application still needs
to be changed to handle the commit case properly

My point here is that looking at the initTransactions() call it's not clear
what the semantics is.  Say I'm doing code review, I cannot say if the code
is correct or not -- if the config (that's something that's
theoretically not known at the time of code review) is going to enable 2PC,
then the correct code should look one way, otherwise it would need to look
differently.  Also, say if code is written with InitTransaction() without
explicit abort and then for whatever reason the code would get used with
2PC enabled (could be a library in a bigger product) it'll start breaking
in a non-intuitive way.

> 20B. Hmm, if the admin disables 2PC, there is likely a reason behind that

That's true, but reality may be more complicated.  Say a user wants to run
a self-managed Flink with Confluent cloud.  Confluent cloud adim may not
be comfortable enabling 2PC to general user accounts that use services not
managed by Confluent (the same way Confluent doesn't allow increasing max
transaction timeout for general user accounts).  Right now, self-managed
Flink works because it uses reflection, if it moves to use public APIs
provided by KIP-939 it'll break.

> 32. Ok. That's the kafka metric. In that case, the metric name has a
group and a name. There is no type and no package name.

Is this a suggestion to change or confirmation that the current logic is
ok?  I just copied an existing metric but can change if needed.

-Artem

On Tue, Feb 20, 2024 at 11:25 AM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply.
>
> 20. "Say if an application
> currently uses initTransactions() to achieve the current semantics, it
> would need to be rewritten to use initTransactions() + abort to achieve the
> same semantics if the config is changed. "
>
> This only takes care of the abort case. The application still needs to be
> changed to handle the commit case properly
> if transaction.two.phase.commit.enable is set to true.
>
> "Even when KIP-939 is implemented,
> there would be situations when 2PC is disabled by the admin (e.g. Kafka
> service providers may be reluctant to enable 2PC for Flink services that
> users host themselves), so we either have to perpetuate the
> reflection-based implementation in Flink or enable keepPreparedTxn=true
> without 2PC."
>
> Hmm, if the admin disables 2PC, there is likely a reason behind that. I am
> not sure that we should provide an API to encourage the application to
> circumvent that.
>
> 32. Ok. That's the kafka metric. In that case, the metric name has a group
> and a name. There is no type and no package name.
>
> Jun
>
>
> On Thu, Feb 15, 2024 at 8:23 PM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > Thank you for your questions.
> >
> > > 20. So to abort a prepared transaction after the producer start, we
> could
> > use ...
> >
> > I agree, initTransaction(true) + abort would accomplish the behavior of
> > initTransactions(false), so we could technically have fewer ways to
> achieve
> > the same thing, which is generally valuable.  I wonder, though, if that
> > would be intuitive from the application perspective.  Say if an
> application
> > currently uses initTransactions() to achieve the current semantics, it
> > would need to be rewritten to use initTransactions() + abort to achieve
> the
> > same semantics if the config is changed.  I think this could create
> > subtle confusion, as the config change is generally decoupled from
> changing
> > application implementation.
> >
> > >  The use case mentioned for keepPreparedTxn=true without 2PC doesn't
> seem
> > very important
> >
> > I agree, it's not a strict requirement.  It is, however, a missing option
> > in the public API, so currently Flink has to use reflection to emulate
> this
> > functionality without 2PC support.   Even when KIP-939 is implemented,
> > there would be situations when 2PC is disabled by the admin (e.g. Kafka
> > service providers may be reluctant to enable 2PC for Flink services that
> > users host themselves), so we either have to perpetuate the
> > reflection-based implementation in Flink or enable keepPreparedTxn=true
> > without 2PC.
> >
> > > 32.
> >
> >
> kafka.server:type=transaction-coordinator-metrics,name=active-transaction-open-time-max
> >
> > I just followed the existing metric implementation example
> >
> >
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L95
> > ,
> > which maps to
> >
> &g

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-02-16 Thread Artem Livshits
Hi Rowland,

> I am not sure what you mean by guarantee,

A guarantee would be an elimination of complexity or a condition.  E.g. if
adding an explicit prepare RPC eliminated in-doubt transactions, or
eliminated a significant complexity in implementation.

> 1. Transactions that haven’t reached “prepared” state can be aborted via
timeout.

The argument is that it doesn't eliminate any conditions, it merely reduces
a subset of circumstances for the conditions to happen, but the conditions
still happen and must be handled.  The operator still needs to set up
monitoring for run-away transactions, there still needs to be an
"out-of-band" channel to resolve run-away transactions (i.e. the operation
would need a way that's not a part of the 2PC protocol to reconcile with
the application owner), there still needs to be tooling for resolving
run-away transactions.

On the downside, an explicit prepare RPC would have a performance hit on
the happy path in every single transaction.

-Artem

On Tue, Feb 6, 2024 at 7:35 PM Rowland Smith  wrote:

> Hi Artem,
>
> I am not sure what you mean by guarantee, but I am referring to a better
> operational experience. You mentioned this as the first benefit of an
> explicit "prepare" RPC in the KIP.
>
>
> 1. Transactions that haven’t reached “prepared” state can be aborted via
> timeout.
>
> However, in explaining why an explicit "prepare" RPC was not included in
> the design, you make no further mention of this benefit. So what I am
> saying is this benefit is quite significant operationally. Many client
> application failures may occur before the transaction reaches the prepared
> state, and the ability to automatically abort those transactions and
> unblock affected partitions without administrative intervention or fast
> restart of the client would be a worthwhile benefit. An explicit "prepare"
> RPC will also be needed by the XA implementation, so I would like to see it
> implemented for that reason. Otherwise, I will need to add this work to my
> KIP.
>
> - Rowland
>
> On Mon, Feb 5, 2024 at 9:35 PM Artem Livshits
>  wrote:
>
> > Hi Rowland,
> >
> > Thank you for your reply.  I think I understand what you're saying and
> just
> > tried to provide a quick summary.  The
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC#KIP939:SupportParticipationin2PC-Explicit%E2%80%9Cprepare%E2%80%9DRPC
> > actually goes into the details on what would be the benefits of adding an
> > explicit prepare RPC and why those won't really add any advantages such
> as
> > elimination the needs for monitoring, tooling or providing additional
> > guarantees.  Let me know if you think of a guarantee that prepare RPC
> would
> > provide.
> >
> > -Artem
> >
> > On Mon, Feb 5, 2024 at 6:22 PM Rowland Smith  wrote:
> >
> > > Hi Artem,
> > >
> > > I don't think that you understand what I am saying. In any transaction,
> > > there is work done before the call to prepareTranscation() and work
> done
> > > afterwards. Any work performed before the call to prepareTransaction()
> > can
> > > be aborted after a relatively short timeout if the client fails. It is
> > only
> > > after the prepareTransaction() call that a transaction becomes in-doubt
> > and
> > > must be remembered for a much longer period of time to allow the client
> > to
> > > recover and make the decision to either commit or abort. A considerable
> > > amount of time might be spent before prepareTransaction() is called,
> and
> > if
> > > the client fails in this period, relatively quick transaction abort
> would
> > > unblock any partitions and make the system fully available. So a
> prepare
> > > RPC would reduce the window where a client failure results in
> potentially
> > > long-lived blocking transactions.
> > >
> > > Here is the proposed sequence from the KIP with 2 added steps (4 and
> 5):
> > >
> > >
> > >1. Begin database transaction
> > >2. Begin Kafka transaction
> > >3. Produce data to Kafka
> > >4. Make updates to the database
> > >5. Repeat steps 3 and 4 as many times as necessary based on
> > application
> > >needs.
> > >6. Prepare Kafka transaction [currently implicit operation,
> expressed
> > as
> > >flush]
> > >7. Write produced data to the database
> > >8. Write offsets of produced data to the database
> > >9. Commit database transaction
> > >10. Commit Kafka transacti

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-02-15 Thread Artem Livshits
Hi Jun,

Thank you for your questions.

> 20. So to abort a prepared transaction after the producer start, we could
use ...

I agree, initTransaction(true) + abort would accomplish the behavior of
initTransactions(false), so we could technically have fewer ways to achieve
the same thing, which is generally valuable.  I wonder, though, if that
would be intuitive from the application perspective.  Say if an application
currently uses initTransactions() to achieve the current semantics, it
would need to be rewritten to use initTransactions() + abort to achieve the
same semantics if the config is changed.  I think this could create
subtle confusion, as the config change is generally decoupled from changing
application implementation.

>  The use case mentioned for keepPreparedTxn=true without 2PC doesn't seem
very important

I agree, it's not a strict requirement.  It is, however, a missing option
in the public API, so currently Flink has to use reflection to emulate this
functionality without 2PC support.   Even when KIP-939 is implemented,
there would be situations when 2PC is disabled by the admin (e.g. Kafka
service providers may be reluctant to enable 2PC for Flink services that
users host themselves), so we either have to perpetuate the
reflection-based implementation in Flink or enable keepPreparedTxn=true
without 2PC.

> 32.
kafka.server:type=transaction-coordinator-metrics,name=active-transaction-open-time-max

I just followed the existing metric implementation example
https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L95,
which maps to
kafka.server:type=transaction-coordinator-metrics,name=partition-load-time-max.

> 33. "If the value is 'true' then the corresponding field is set

That's correct.  Updated the KIP.

-Artem

On Wed, Feb 7, 2024 at 10:06 AM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply.
>
> 20. So to abort a prepared transaction after producer start, we could use
> either
>   producer.initTransactions(false)
> or
>   producer.initTransactions(true)
>   producer.abortTransaction
> Could we just always use the latter API? If we do this, we could
> potentially eliminate the keepPreparedTxn flag in initTransactions(). After
> the initTransactions() call, the outstanding txn is always preserved if 2pc
> is enabled and aborted if 2pc is disabled. The use case mentioned for
> keepPreparedTxn=true without 2PC doesn't seem very important. If we could
> do that, it seems that we have (1) less redundant and simpler APIs; (2)
> more symmetric syntax for aborting/committing a prepared txn after producer
> restart.
>
> 32.
>
> kafka.server:type=transaction-coordinator-metrics,name=active-transaction-open-time-max
> Is this a Yammer or kafka metric? The former uses the camel case for name
> and type. The latter uses the hyphen notation, but doesn't have the type
> attribute.
>
> 33. "If the value is 'true' then the corresponding field is set in the
> InitProducerIdRequest and the KafkaProducer object is set into a state
> which only allows calling .commitTransaction or .abortTransaction."
> We should also allow .completeTransaction, right?
>
> Jun
>
>
> On Tue, Feb 6, 2024 at 3:29 PM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > > 20. For Flink usage, it seems that the APIs used to abort and commit a
> > prepared txn are not symmetric.
> >
> > For Flink it is expected that Flink would call .commitTransaction or
> > .abortTransaction directly, it wouldn't need to deal with
> PreparedTxnState,
> > the outcome is actually determined by the Flink's job manager, not by
> > comparison of PreparedTxnState.  So for Flink, if the Kafka sync crashes
> > and restarts there are 2 cases:
> >
> > 1. Transaction is not prepared.  In that case just call
> > producer.initTransactions(false) and then can start transactions as
> needed.
> > 2. Transaction is prepared.  In that case call
> > producer.initTransactions(true) and wait for the decision from the job
> > manager.  Note that it's not given that the transaction will get
> committed,
> > the decision could also be an abort.
> >
> >  > 21. transaction.max.timeout.ms could in theory be MAX_INT. Perhaps we
> > could use a negative timeout in the record to indicate 2PC?
> >
> > -1 sounds good, updated.
> >
> > > 30. The KIP has two different APIs to abort an ongoing txn. Do we need
> > both?
> >
> > I think of producer.initTransactions() to be an implementation for
> > adminClient.forceTerminateTransaction(transactionalId).
> >
> > > 31. "This would flush all the pending messages and transition the
> > producer
> >
> &

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-02-06 Thread Artem Livshits
Hi Jun,

> 20. For Flink usage, it seems that the APIs used to abort and commit a
prepared txn are not symmetric.

For Flink it is expected that Flink would call .commitTransaction or
.abortTransaction directly, it wouldn't need to deal with PreparedTxnState,
the outcome is actually determined by the Flink's job manager, not by
comparison of PreparedTxnState.  So for Flink, if the Kafka sync crashes
and restarts there are 2 cases:

1. Transaction is not prepared.  In that case just call
producer.initTransactions(false) and then can start transactions as needed.
2. Transaction is prepared.  In that case call
producer.initTransactions(true) and wait for the decision from the job
manager.  Note that it's not given that the transaction will get committed,
the decision could also be an abort.

 > 21. transaction.max.timeout.ms could in theory be MAX_INT. Perhaps we
could use a negative timeout in the record to indicate 2PC?

-1 sounds good, updated.

> 30. The KIP has two different APIs to abort an ongoing txn. Do we need
both?

I think of producer.initTransactions() to be an implementation for
adminClient.forceTerminateTransaction(transactionalId).

> 31. "This would flush all the pending messages and transition the producer

Updated the KIP to clarify that IllegalStateException will be thrown.

-Artem


On Mon, Feb 5, 2024 at 2:22 PM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply.
>
> 20. For Flink usage, it seems that the APIs used to abort and commit a
> prepared txn are not symmetric.
> To abort, the app will just call
>   producer.initTransactions(false)
>
> To commit, the app needs to call
>   producer.initTransactions(true)
>   producer.completeTransaction(preparedTxnState)
>
> Will this be a concern? For the dual-writer usage, both abort/commit use
> the same API.
>
> 21. transaction.max.timeout.ms could in theory be MAX_INT. Perhaps we
> could
> use a negative timeout in the record to indicate 2PC?
>
> 30. The KIP has two different APIs to abort an ongoing txn. Do we need
> both?
>   producer.initTransactions(false)
>   adminClient.forceTerminateTransaction(transactionalId)
>
> 31. "This would flush all the pending messages and transition the producer
> into a mode where only .commitTransaction, .abortTransaction, or
> .completeTransaction could be called.  If the call is successful (all
> messages successfully got flushed to all partitions) the transaction is
> prepared."
>  If the producer calls send() in that state, what exception will the caller
> receive?
>
> Jun
>
>
> On Fri, Feb 2, 2024 at 3:34 PM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > >  Then, should we change the following in the example to use
> > InitProducerId(true) instead?
> >
> > We could. I just thought that it's good to make the example
> self-contained
> > by starting from a clean state.
> >
> > > Also, could Flink just follow the dual-write recipe?
> >
> > I think it would bring some unnecessary logic to Flink (or any other
> system
> > that already has a transaction coordinator and just wants to drive Kafka
> to
> > the desired state).  We could discuss it with Flink folks, the current
> > proposal was developed in collaboration with them.
> >
> > > 21. Could a non 2pc user explicitly set the TransactionTimeoutMs to
> > Integer.MAX_VALUE?
> >
> > The server would reject this for regular transactions, it only accepts
> > values that are <= *transaction.max.timeout.ms
> > <http://transaction.max.timeout.ms> *(a broker config).
> >
> > > 24. Hmm, In KIP-890, without 2pc, the coordinator expects the endTxn
> > request to use the ongoing pid. ...
> >
> > Without 2PC there is no case where the pid could change between starting
> a
> > transaction and endTxn (InitProducerId would abort any ongoing
> > transaction).  WIth 2PC there is now a case where there could be
> > InitProducerId that can change the pid without aborting the transaction,
> so
> > we need to handle that.  I wouldn't say that the flow is different, but
> > it's rather extended to handle new cases.  The main principle is still
> the
> > same -- for all operations we use the latest "operational" pid and epoch
> > known to the client, this way we guarantee that we can fence zombie /
> split
> > brain clients by disrupting the "latest known" pid + epoch progression.
> >
> > > 25. "We send out markers using the original ongoing transaction
> > ProducerId and ProducerEpoch" ...
> >
> > Updated.
> >
> > -Artem
> >
> > On Mon, Jan 29, 2024 at 4:57 PM Jun Rao 
> wrote:
> >
> > &g

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-02-05 Thread Artem Livshits
Hi Rowland,

Thank you for your reply.  I think I understand what you're saying and just
tried to provide a quick summary.  The
https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC#KIP939:SupportParticipationin2PC-Explicit%E2%80%9Cprepare%E2%80%9DRPC
actually goes into the details on what would be the benefits of adding an
explicit prepare RPC and why those won't really add any advantages such as
elimination the needs for monitoring, tooling or providing additional
guarantees.  Let me know if you think of a guarantee that prepare RPC would
provide.

-Artem

On Mon, Feb 5, 2024 at 6:22 PM Rowland Smith  wrote:

> Hi Artem,
>
> I don't think that you understand what I am saying. In any transaction,
> there is work done before the call to prepareTranscation() and work done
> afterwards. Any work performed before the call to prepareTransaction() can
> be aborted after a relatively short timeout if the client fails. It is only
> after the prepareTransaction() call that a transaction becomes in-doubt and
> must be remembered for a much longer period of time to allow the client to
> recover and make the decision to either commit or abort. A considerable
> amount of time might be spent before prepareTransaction() is called, and if
> the client fails in this period, relatively quick transaction abort would
> unblock any partitions and make the system fully available. So a prepare
> RPC would reduce the window where a client failure results in potentially
> long-lived blocking transactions.
>
> Here is the proposed sequence from the KIP with 2 added steps (4 and 5):
>
>
>1. Begin database transaction
>2. Begin Kafka transaction
>3. Produce data to Kafka
>4. Make updates to the database
>5. Repeat steps 3 and 4 as many times as necessary based on application
>needs.
>6. Prepare Kafka transaction [currently implicit operation, expressed as
>flush]
>7. Write produced data to the database
>8. Write offsets of produced data to the database
>9. Commit database transaction
>10. Commit Kafka transaction
>
>
> If the client application crashes before step 6, it is safe to abort the
> Kafka transaction after a relatively short timeout.
>
> I fully agree with a layered approach. However, the XA layer is going to
> require certain capabilities from the layer below it, and one of those
> capabilities is to be able to identify and report prepared transactions
> during recovery.
>
> - Rowland
>
> On Mon, Feb 5, 2024 at 12:46 AM Artem Livshits
>  wrote:
>
> > Hi Rowland,
> >
> > Thank you for your feedback.  Using an explicit prepare RPC was discussed
> > and is listed in the rejected alternatives:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC#KIP939:SupportParticipationin2PC-Explicit%E2%80%9Cprepare%E2%80%9DRPC
> > .
> > Basically, even if we had an explicit prepare RPC, it doesn't avoid the
> > fact that a crashed client could cause a blocking transaction.  This is,
> > btw, is not just a specific property of this concrete proposal, it's a
> > fundamental trade off of any form of 2PC -- any 2PC implementation must
> > allow for infinitely "in-doubt" transactions that may not be unilaterally
> > automatically resolved within one participant.
> >
> > To mitigate the issue, using 2PC requires a special permission, so that
> the
> > Kafka admin could control that only applications that follow proper
> > standards in terms of availability (i.e. will automatically restart and
> > cleanup after a crash) would be allowed to utilize 2PC.  It is also
> assumed
> > that any practical deployment utilizing 2PC would have monitoring set up,
> > so that an operator could be alerted to investigate and manually resolve
> > in-doubt transactions (the metric and tooling support for doing so are
> also
> > described in the KIP).
> >
> > For XA support, I wonder if we could take a layered approach and store XA
> > information in a separate store, say in a compacted topic.  This way, the
> > core Kafka protocol could be decoupled from specific implementations (and
> > extra performance requirements that a specific implementation may impose)
> > and serve as a foundation for multiple implementations.
> >
> > -Artem
> >
> > On Sun, Feb 4, 2024 at 1:37 PM Rowland Smith  wrote:
> >
> > > Hi Artem,
> > >
> > > It has been a while, but I have gotten back to this. I understand that
> > when
> > > 2PC is used, the transaction timeout will be effectively infinite. I
> > don't
> > > think that this behavior is desi

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-02-04 Thread Artem Livshits
Hi Rowland,

Thank you for your feedback.  Using an explicit prepare RPC was discussed
and is listed in the rejected alternatives:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC#KIP939:SupportParticipationin2PC-Explicit%E2%80%9Cprepare%E2%80%9DRPC.
Basically, even if we had an explicit prepare RPC, it doesn't avoid the
fact that a crashed client could cause a blocking transaction.  This is,
btw, is not just a specific property of this concrete proposal, it's a
fundamental trade off of any form of 2PC -- any 2PC implementation must
allow for infinitely "in-doubt" transactions that may not be unilaterally
automatically resolved within one participant.

To mitigate the issue, using 2PC requires a special permission, so that the
Kafka admin could control that only applications that follow proper
standards in terms of availability (i.e. will automatically restart and
cleanup after a crash) would be allowed to utilize 2PC.  It is also assumed
that any practical deployment utilizing 2PC would have monitoring set up,
so that an operator could be alerted to investigate and manually resolve
in-doubt transactions (the metric and tooling support for doing so are also
described in the KIP).

For XA support, I wonder if we could take a layered approach and store XA
information in a separate store, say in a compacted topic.  This way, the
core Kafka protocol could be decoupled from specific implementations (and
extra performance requirements that a specific implementation may impose)
and serve as a foundation for multiple implementations.

-Artem

On Sun, Feb 4, 2024 at 1:37 PM Rowland Smith  wrote:

> Hi Artem,
>
> It has been a while, but I have gotten back to this. I understand that when
> 2PC is used, the transaction timeout will be effectively infinite. I don't
> think that this behavior is desirable. A long running transaction can be
> extremely disruptive since it blocks consumers on any partitions written to
> within the pending transaction. The primary reason for a long running
> transaction is a failure of the client, or the network connecting the
> client to the broker. If such a failure occurs before the client calls
> the new prepareTransaction() method, it should be OK to abort the
> transaction after a relatively short timeout period. This approach would
> minimize the inconvenience and disruption of a long running transaction
> blocking consumers, and provide higher availability for a system using
> Kafka.
>
> In order to achieve this behavior, I think we would need a 'prepare' RPC
> call so that the server knows that a transaction has been prepared, and
> does not timeout and abort such transactions. There will be some cost to
> this extra RPC call, but there will also be a benefit of better system
> availability in case of failures.
>
> There is another reason why I would prefer this implementation. I am
> working on an XA KIP, and XA requires that Kafka brokers be able to provide
> a list of prepared transactions during recovery.  The broker can only know
> that a transaction has been prepared if an RPC call is made., so my KIP
> will need this functionality. In the XA KIP, I would like to use as much of
> the KIP-939 solution as possible, so it would be helpful if
> prepareTransactions() sent a 'prepare' RPC, and the broker recorded the
> prepared transaction state.
>
> This could be made configurable behavior if we are concerned that the cost
> of the extra RPC call is too much, and that some users would prefer to have
> speed in exchange for less system availability in some cases of client or
> network failure.
>
> Let me know what you think.
>
> -Rowland
>
> On Fri, Jan 5, 2024 at 8:03 PM Artem Livshits
>  wrote:
>
> > Hi Rowland,
> >
> > Thank you for the feedback.  For the 2PC cases, the expectation is that
> the
> > timeout on the client would be set to "effectively infinite", that would
> > exceed all practical 2PC delays.  Now I think that this flexibility is
> > confusing and can be misused, I have updated the KIP to just say that if
> > 2PC is used, the transaction never expires.
> >
> > -Artem
> >
> > On Thu, Jan 4, 2024 at 6:14 PM Rowland Smith  wrote:
> >
> > > It is probably me. I copied the original message subject into a new
> > email.
> > > Perhaps that is not enough to link them.
> > >
> > > It was not my understanding from reading KIP-939 that we are doing away
> > > with any transactional timeout in the Kafka broker. As I understand it,
> > we
> > > are allowing the application to set the transaction timeout to a value
> > that
> > > exceeds the *transaction.max.timeout.ms
> > > <http://transaction.max.timeout.ms>* setting

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-02-02 Thread Artem Livshits
Hi Jun,

>  Then, should we change the following in the example to use
InitProducerId(true) instead?

We could. I just thought that it's good to make the example self-contained
by starting from a clean state.

> Also, could Flink just follow the dual-write recipe?

I think it would bring some unnecessary logic to Flink (or any other system
that already has a transaction coordinator and just wants to drive Kafka to
the desired state).  We could discuss it with Flink folks, the current
proposal was developed in collaboration with them.

> 21. Could a non 2pc user explicitly set the TransactionTimeoutMs to
Integer.MAX_VALUE?

The server would reject this for regular transactions, it only accepts
values that are <= *transaction.max.timeout.ms
<http://transaction.max.timeout.ms> *(a broker config).

> 24. Hmm, In KIP-890, without 2pc, the coordinator expects the endTxn
request to use the ongoing pid. ...

Without 2PC there is no case where the pid could change between starting a
transaction and endTxn (InitProducerId would abort any ongoing
transaction).  WIth 2PC there is now a case where there could be
InitProducerId that can change the pid without aborting the transaction, so
we need to handle that.  I wouldn't say that the flow is different, but
it's rather extended to handle new cases.  The main principle is still the
same -- for all operations we use the latest "operational" pid and epoch
known to the client, this way we guarantee that we can fence zombie / split
brain clients by disrupting the "latest known" pid + epoch progression.

> 25. "We send out markers using the original ongoing transaction
ProducerId and ProducerEpoch" ...

Updated.

-Artem

On Mon, Jan 29, 2024 at 4:57 PM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply.
>
> 20. So for the dual-write recipe, we should always call
> InitProducerId(keepPreparedTxn=true) from the producer? Then, should we
> change the following in the example to use InitProducerId(true) instead?
> 1. InitProducerId(false); TC STATE: Empty, ProducerId=42,
> ProducerEpoch=MAX-1, PrevProducerId=-1, NextProducerId=-1,
> NextProducerEpoch=-1; RESPONSE ProducerId=42, Epoch=MAX-1,
> OngoingTxnProducerId=-1, OngoingTxnEpoch=-1.
> Also, could Flink just follow the dual-write recipe? It's simpler if there
> is one way to solve the 2pc issue.
>
> 21. Could a non 2pc user explicitly set the TransactionTimeoutMs to
> Integer.MAX_VALUE?
>
> 24. Hmm, In KIP-890, without 2pc, the coordinator expects the endTxn
> request to use the ongoing pid. With 2pc, the coordinator now expects the
> endTxn request to use the next pid. So, the flow is different, right?
>
> 25. "We send out markers using the original ongoing transaction ProducerId
> and ProducerEpoch"
> We should use ProducerEpoch + 1 in the marker, right?
>
> Jun
>
> On Fri, Jan 26, 2024 at 8:35 PM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > > 20.  I am a bit confused by how we set keepPreparedTxn.  ...
> >
> > keepPreparedTxn=true informs the transaction coordinator that it should
> > keep the ongoing transaction, if any.  If the keepPreparedTxn=false, then
> > any ongoing transaction is aborted (this is exactly the current
> behavior).
> > enable2Pc is a separate argument that is controlled by the
> > *transaction.two.phase.commit.enable *setting on the client.
> >
> > To start 2PC, the client just needs to set
> > *transaction.two.phase.commit.enable*=true in the config.  Then if the
> > client knows the status of the transaction upfront (in the case of Flink,
> > Flink keeps the knowledge if the transaction is prepared in its own
> store,
> > so it always knows upfront), it can set keepPreparedTxn accordingly, then
> > if the transaction was prepared, it'll be ready for the client to
> complete
> > the appropriate action; if the client doesn't have a knowledge that the
> > transaction is prepared, keepPreparedTxn is going to be false, in which
> > case we'll get to a clean state (the same way we do today).
> >
> > For the dual-write recipe, the client doesn't know upfront if the
> > transaction is prepared, this information is implicitly encoded
> > PreparedTxnState value that can be used to resolve the transaction state.
> > In that case, keepPreparedTxn should always be true, because we don't
> know
> > upfront and we don't want to accidentally abort a committed transaction.
> >
> > The forceTerminateTransaction call can just use keepPreparedTxn=false, it
> > actually doesn't matter if it sets Enable2Pc flag.
> >
> > > 21. TransactionLogValue: Do we need some field to identify whether this
> > is written for 2PC so that ongoing txn is never auto aborted?
&g

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-01-26 Thread Artem Livshits
produceId to commit an old txn works, but
> can be confusing. It's going to be hard for people implementing this new
> client protocol to figure out when to use the current or the new producerId
> in the EndTxnRequest. One potential way to improve this is to extend
> EndTxnRequest with a new field like expectedNextProducerId. Then we can
> always use the old produceId in the existing field, but set
> expectedNextProducerId to bypass the fencing logic when needed.
>
> Thanks,
>
> Jun
>
>
>
> On Mon, Dec 18, 2023 at 2:06 PM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > Thank you for the comments.
> >
> > > 10. For the two new fields in Enable2Pc and KeepPreparedTxn ...
> >
> > I added a note that all combinations are valid.  Enable2Pc=false &
> > KeepPreparedTxn=true could be potentially useful for backward
> compatibility
> > with Flink, when the new version of Flink that implements KIP-319 tries
> to
> > work with a cluster that doesn't authorize 2PC.
> >
> > > 11.  InitProducerIdResponse: If there is no ongoing txn, what will
> > OngoingTxnProducerId and OngoingTxnEpoch be set?
> >
> > I added a note that they will be set to -1.  The client then will know
> that
> > there is no ongoing txn and .completeTransaction becomes a no-op (but
> still
> > required before .send is enabled).
> >
> > > 12. ListTransactionsRequest related changes: It seems those are already
> > covered by KIP-994?
> >
> > Removed from this KIP.
> >
> > > 13. TransactionalLogValue ...
> >
> > This is now updated to work on top of KIP-890.
> >
> > > 14. "Note that the (producerId, epoch) pair that corresponds to the
> > ongoing transaction ...
> >
> > This is now updated to work on top of KIP-890.
> >
> > > 15. active-transaction-total-time-max : ...
> >
> > Updated.
> >
> > > 16. "transaction.two.phase.commit.enable The default would be ‘false’.
> > If it’s ‘false’, 2PC functionality is disabled even if the ACL is set ...
> >
> > Disabling 2PC effectively removes all authorization to use it, hence I
> > thought TRANSACTIONAL_ID_AUTHORIZATION_FAILED would be appropriate.
> >
> > Do you suggest using a different error code for 2PC authorization vs some
> > other authorization (e.g. TRANSACTIONAL_ID_2PC_AUTHORIZATION_FAILED) or a
> > different code for disabled vs. unauthorised (e.g.
> > TWO_PHASE_COMMIT_DISABLED) or both?
> >
> > > 17. completeTransaction(). We expect this to be only used during
> > recovery.
> >
> > It can also be used if, say, a commit to the database fails and the
> result
> > is inconclusive, e.g.
> >
> > 1. Begin DB transaction
> > 2. Begin Kafka transaction
> > 3. Prepare Kafka transaction
> > 4. Commit DB transaction
> > 5. The DB commit fails, figure out the state of the transaction by
> reading
> > the PreparedTxnState from DB
> > 6. Complete Kafka transaction with the PreparedTxnState.
> >
> > > 18. "either prepareTransaction was called or initTransaction(true) was
> > called": "either" should be "neither"?
> >
> > Updated.
> >
> > > 19. Since InitProducerId always bumps up the epoch, it creates a
> > situation ...
> >
> > InitProducerId only bumps the producer epoch, the ongoing transaction
> epoch
> > stays the same, no matter how many times the InitProducerId is called
> > before the transaction is completed.  Eventually the epoch may overflow,
> > and then a new producer id would be allocated, but the ongoing
> transaction
> > producer id would stay the same.
> >
> > I've added a couple examples in the KIP (
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC#KIP939:SupportParticipationin2PC-PersistedDataFormatChanges
> > )
> > that walk through some scenarios and show how the state is changed.
> >
> > -Artem
> >
> > On Fri, Dec 8, 2023 at 6:04 PM Jun Rao  wrote:
> >
> > > Hi, Artem,
> > >
> > > Thanks for the KIP. A few comments below.
> > >
> > > 10. For the two new fields in Enable2Pc and KeepPreparedTxn in
> > > InitProducerId, it would be useful to document a bit more detail on
> what
> > > values are set under what cases. For example, are all four combinations
> > > valid?
> > >
> > > 11.  InitProducerIdResponse: If there is no ongoing txn, what will
> > > OngoingTxnProducerId and OngoingTxnEpoch be se

Re: [DISCUSS] KIP-890 Server Side Defense

2024-01-22 Thread Artem Livshits
>  Hmm -- we would fence the producer if the epoch is bumped and we get a
lower epoch. Yes -- we are intentionally adding this to prevent fencing.

I think Jun's point is that we can defer the fencing decision until
transition into complete state (which I believe is what the current logic
is doing) -- just return CONCURRENT_TRANSACTIONS without checking the epoch
while in the prepare state.

That said, we do need to remember the next producer id somewhere in the
prepare state, because in the complete state we would need to make a
fencing decision and let the old producer in if the request is the retry
commit / abort operation.

An alternative could be to not reply to the client until complete state is
written, then we don't have to generate a new producer id during prepare
state.  But that would affect pipelining opportunities and probably require
a separate KIP to discuss the pros and cons.

-Artem

On Mon, Jan 22, 2024 at 11:34 AM Justine Olshan
 wrote:

> 101.3 I realized that I actually have two questions.
> > (1) In the non-overflow case, we need to write the previous produce Id
> tagged field in the end maker so that we know if the marker is from the new
> client. Since the end maker is derived from the prepare marker, should we
> write the previous produce Id in the prepare marker field too? Otherwise,
> we will lose this information when deriving the end marker.
>
> The "previous" producer ID is in the normal producer ID field. So yes, we
> need it in prepare and that was always the plan.
>
> Maybe it is a bit unclear so I will enumerate the fields and add them to
> the KIP if that helps.
> Say we have producer ID x and epoch y. When we overflow epoch y we get
> producer ID Z.
>
> PREPARE
> producerId: x
> previous/lastProducerId (tagged field): empty
> nextProducerId (tagged field): empty or z if y will overflow
> producerEpoch: y + 1
>
> COMPLETE
> producerId: x or z if y overflowed
> previous/lastProducerId (tagged field): x
> nextProducerId (tagged field): empty
> producerEpoch: y + 1 or 0 if we overflowed
>
> (2) In the prepare phase, if we retry and see epoch - 1 + ID in last seen
> fields and are issuing the same command (ie commit not abort), we return
> success. The logic before KIP-890 seems to return CONCURRENT_TRANSACTIONS
> in this case. Are we intentionally making this change?
>
> Hmm -- we would fence the producer if the epoch is bumped and we get a
> lower epoch. Yes -- we are intentionally adding this to prevent fencing.
>
>
> 112. We already merged the code that adds the VerifyOnly field in
> AddPartitionsToTxnRequest, which is an inter broker request. It seems that
> we didn't bump up the IBP for that. Do you know why?
>
> We no longer need IBP for all interbroker requests as ApiVersions should
> correctly gate versioning.
> We also handle unsupported version errors correctly if we receive them in
> edge cases like upgrades/downgrades.
>
> Justine
>
> On Mon, Jan 22, 2024 at 11:00 AM Jun Rao  wrote:
>
> > Hi, Justine,
> >
> > Thanks for the reply.
> >
> > 101.3 I realized that I actually have two questions.
> > (1) In the non-overflow case, we need to write the previous produce Id
> > tagged field in the end maker so that we know if the marker is from the
> new
> > client. Since the end maker is derived from the prepare marker, should we
> > write the previous produce Id in the prepare marker field too? Otherwise,
> > we will lose this information when deriving the end marker.
> > (2) In the prepare phase, if we retry and see epoch - 1 + ID in last seen
> > fields and are issuing the same command (ie commit not abort), we return
> > success. The logic before KIP-890 seems to return CONCURRENT_TRANSACTIONS
> > in this case. Are we intentionally making this change?
> >
> > 112. We already merged the code that adds the VerifyOnly field in
> > AddPartitionsToTxnRequest, which is an inter broker request. It seems
> that
> > we didn't bump up the IBP for that. Do you know why?
> >
> > Jun
> >
> > On Fri, Jan 19, 2024 at 4:50 PM Justine Olshan
> > 
> > wrote:
> >
> > > Hi Jun,
> > >
> > > 101.3 I can change "last seen" to "current producer id and epoch" if
> that
> > > was the part that was confusing
> > > 110 I can mention this
> > > 111 I can do that
> > > 112 We still need it. But I am still finalizing the design. I will
> update
> > > the KIP once I get the information finalized. Sorry for the delays.
> > >
> > > Justine
> > >
> > > On Fri, Jan 19, 2024 at 10:50 AM Jun Rao 
> > wrote:
> > >
> > > > Hi, Justine,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 101.3 In the non-overflow case, the previous ID is the same as the
> > > produce
> > > > ID for the complete marker too, but we set the previous ID in the
> > > complete
> > > > marker. Earlier you mentioned that this is to know that the marker is
> > > > written by the new client so that we could return success on retried
> > > > endMarker requests. I was trying to understand why this is not needed
> > for
> > > > the 

Re: [DISCUSS] KIP-1014: Managing Unstable Metadata Versions in Apache Kafka

2024-01-19 Thread Artem Livshits
Hi Colin,

>  I think feature flags are somewhat orthogonal to the stable / unstable
discussion

I think feature flags can be used as an alternative to achieve similar
results as stable / unstable functionality.  As well as long-lived feature
branches.  In my experience, I've seen feature flags to be more successful
than feature branches for changes of existing functionality.  I also think
that stable / unstable MV would work better than feature branches. I just
wanted to mention it for completeness, not sure if we should start a full
fledged discussion on these topics.

> I'm struggling a bit with your phrasing. Are you suggesting that unstable
MVs would not be able to be in trunk?

Unstable MV should be in trunk.  The wording is related to when we can
promote "unstable" to "stable".

-Artem


On Mon, Jan 15, 2024 at 10:03 PM Colin McCabe  wrote:

> On Fri, Jan 12, 2024, at 11:32, Artem Livshits wrote:
> > I think using feature flags (whether we support a framework and tooling
> for
> > feature flags or just an ad-hoc XyzEnabled flag) can be an alternative to
> > this KIP.  I think the value of this KIP is that it's trying to propose a
> > systemic approach for gating functionality that may take multiple
> releases
> > to develop.  A problem with ad-hoc feature flags is that it's useful
> during
> > feature development, so that people who are working on this feature (or
> are
> > interested in beta-testing the feature) can get early access (without any
> > guarantees on compatibility or even correctness); but then the feature
> > flags often stick forever after the feature development is done (and as
> > time moves one, the new code is written in such a way that it's not
> > possible to turn the feature off any more cleanly).
> >
>
> Hi Artem,
>
> I think feature flags are somewhat orthogonal to the stable / unstable
> discussion. Even if every new feature was a feature flag, you probably
> still wouldn't want to stabilize the features immediately, to avoid
> maintaining a lot of alpha stuff forever.
>
> (I also think that feature flags should be used sparingly, if at all,
> because of the way that they exponentially increase the test matrix. But
> that's a tangent, I think, given the first point...)
>
> >
> > I'd also clarify how I think about "stable".  Ismael made a comment "
> > something is stable in the "this is battle-tested" sense.".  I don't
> think
> > it has to be "battle-tested", it just has to meet the bar of being in the
> > trunk.  Again, thinking of a small single-commit feature -- to commit to
> > trunk, the feature doesn't have to be "battle-tested", but it should be
> > complete (and not just a bunch of TODOs), with tests written and some
> level
> > of dev-testing done, so that once the release is cut, we can find and fix
> > bugs and make it release-quality (as opposed to reverting the whole
> > thing).  We can apply the same "stability" bar for features to be in the
> > stable MV -- fully complete, tests written and some level of dev testing
> > done.
> >
>
> I'm struggling a bit with your phrasing. Are you suggesting that unstable
> MVs would not be able to be in trunk? I think we do want them to be able to
> go into trunk. If they have to go into a branch, then there is actually no
> need for any of this.
>
> Doing big features in long-lived branches is one genuine alternative to
> unstable MVs, I think. But it's not an alternative that works well with our
> current tooling for continuous integration, deployment, building, etc. I
> think it would also impact developer productivity somewhat negatively.
>
> best,
> Colin
>
>
> >
> > -Artem
> >
> > On Fri, Jan 12, 2024 at 10:12 AM Justine Olshan
> >  wrote:
> >
> >> Hi Ismael,
> >>
> >> I debated including something about feature flags in my last comment,
> but
> >> maybe I should have.
> >> What you said makes sense.
> >>
> >> Justine
> >>
> >> On Fri, Jan 12, 2024 at 9:31 AM Ismael Juma  wrote:
> >>
> >> > Justine,
> >> >
> >> > For features that are not production-ready, they should have an
> >> additional
> >> > configuration (not the metadata version) that enables/disables it.
> The MV
> >> > specific features we ship are something we have to support and make
> sure
> >> we
> >> > don't break going forward.
> >> >
> >> > Ismael
> >> >
> >> > On Fri, Jan 12, 2024 at 9:26 AM Justine Olshan
> >> > 
> >> > wr

Re: [DISCUSS] KIP-1014: Managing Unstable Metadata Versions in Apache Kafka

2024-01-12 Thread Artem Livshits
I think using feature flags (whether we support a framework and tooling for
feature flags or just an ad-hoc XyzEnabled flag) can be an alternative to
this KIP.  I think the value of this KIP is that it's trying to propose a
systemic approach for gating functionality that may take multiple releases
to develop.  A problem with ad-hoc feature flags is that it's useful during
feature development, so that people who are working on this feature (or are
interested in beta-testing the feature) can get early access (without any
guarantees on compatibility or even correctness); but then the feature
flags often stick forever after the feature development is done (and as
time moves one, the new code is written in such a way that it's not
possible to turn the feature off any more cleanly).

I'd also clarify how I think about "stable".  Ismael made a comment "
something is stable in the "this is battle-tested" sense.".  I don't think
it has to be "battle-tested", it just has to meet the bar of being in the
trunk.  Again, thinking of a small single-commit feature -- to commit to
trunk, the feature doesn't have to be "battle-tested", but it should be
complete (and not just a bunch of TODOs), with tests written and some level
of dev-testing done, so that once the release is cut, we can find and fix
bugs and make it release-quality (as opposed to reverting the whole
thing).  We can apply the same "stability" bar for features to be in the
stable MV -- fully complete, tests written and some level of dev testing
done.

-Artem

On Fri, Jan 12, 2024 at 10:12 AM Justine Olshan
 wrote:

> Hi Ismael,
>
> I debated including something about feature flags in my last comment, but
> maybe I should have.
> What you said makes sense.
>
> Justine
>
> On Fri, Jan 12, 2024 at 9:31 AM Ismael Juma  wrote:
>
> > Justine,
> >
> > For features that are not production-ready, they should have an
> additional
> > configuration (not the metadata version) that enables/disables it. The MV
> > specific features we ship are something we have to support and make sure
> we
> > don't break going forward.
> >
> > Ismael
> >
> > On Fri, Jan 12, 2024 at 9:26 AM Justine Olshan
> > 
> > wrote:
> >
> > > Hi Ismael,
> > >
> > > I think the concern I have about a MV for a feature that is not
> > production
> > > ready is that it blocks any development/features with higher MV
> versions
> > > that could be production ready.
> > >
> > > I do see your point though. Previously MV/IBP was about pure broker
> > > compatibility and not about the confidence in the feature it is
> gating. I
> > > do wonder though if it could be useful to have that sort of gating.
> > > I also wonder if an internal config could be useful so that the average
> > > user doesn't have to worry about the complexities of unstable metadata
> > > versions (and their risks).
> > >
> > > I am ok with options 2 and 2 as well by the way.
> > >
> > > Justine
> > >
> > > On Fri, Jan 12, 2024 at 7:36 AM Ismael Juma  wrote:
> > >
> > > > Hi,
> > > >
> > > > Thanks for the KIP.
> > > >
> > > > Reading the discussion, I think a lot of the confusion is due to the
> > > > "unstable" naming. People are then trying to figure out when we think
> > > > something is stable in the "this is battle-tested" sense. But this
> flag
> > > > should not be about that. We can have an MV for a feature that is not
> > yet
> > > > production-ready (and we did that when KRaft itself was not
> production
> > > > ready). I think this flag is about metadata versions that are
> basically
> > > > unsupported - if you use it, you get to keep the pieces. They exist
> > > solely
> > > > to make the lives of Apache Kafka developers easier. I would even
> > suggest
> > > > that the config we introduce be of the internal variety, ie it won't
> > show
> > > > in the generated documentation and there won't be any compatibility
> > > > guarantee.
> > > >
> > > > Thoughts?
> > > >
> > > > Ismael
> > > >
> > > > On Fri, Jan 5, 2024 at 7:33 AM Proven Provenzano
> > > >  wrote:
> > > >
> > > > > Hey folks,
> > > > >
> > > > > I am starting a discussion thread for managing unstable metadata
> > > versions
> > > > > in Apache Kafka.
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1014%3A+Managing+Unstable+Metadata+Versions+in+Apache+Kafka
> > > > >
> > > > > This KIP is actually already implemented in 3.7 with PR
> > > > > https://github.com/apache/kafka/pull/14860.
> > > > > I have created this KIP to explain the motivation and how managing
> > > > Metadata
> > > > > Versions is expected to work.
> > > > > Comments are greatly appreciated as this process can always be
> > > improved.
> > > > >
> > > > > --
> > > > > --Proven
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-1014: Managing Unstable Metadata Versions in Apache Kafka

2024-01-11 Thread Artem Livshits
Hi Proven,

I'd say that we should do 2 & 2.  The idea is that for small features that
can be done and stabilized within a short period of time (with one or very
few commits) that's exactly what happens -- people interested in testing
in-progress feature could take unstable code from a patch (or private
branch / fork) with the expectation that that private code could create a
state that will not be compatible with anything (or may be completely
broken for that matter -- in the end of the day it's a functionality that
may not be fully tested or even fully implemented); and once the feature is
stable it goes to trunk it is fully committed there, if the bugs are found
they'd get fixed "forward".  The 2 & 2 option pretty much extends this to
large features -- if a feature is above stable MV, then going above it is
like getting some in-progress code for early testing with the expectation
that something may not fully work or leave system in upgradable state;
promoting a feature into a state MV would come with the expectation that
the feature gets fully committed and any bugs will be fixed "forward".

-Artem

On Thu, Jan 11, 2024 at 10:16 AM Proven Provenzano
 wrote:

> We have two approaches here for how we update unstable metadata versions.
>
>1. The update will only increase MVs of unstable features to a value
>greater than the new stable feature. The idea is that a specific
> unstable
>MV may support some set of features and in the future that set is
> always a
>strict subset of the current set. The issue is that moving a feature to
>make way for a stable feature with a higher MV will leave holes.
>2. We are free to reorder the MV for any unstable feature. This removes
>the hole issue, but does make the unstable MVs more muddled. There isn't
>the same binary state for a MV where a feature is available or there is
> a
>hole.
>
>
> We also have two ends of the spectrum as to when we update the stable MV.
>
>1. We update at release points which reduces the amount of churn of the
>unstable MVs and makes a stronger correlation between accepted features
> and
>stable MVs for a release but means less testing on trunk as a stable MV.
>2. We update when the developers of a feature think it is done. This
>leads to features being available for more testing in trunk but forces
> the
>next release to include it as stable.
>
>
> I'd like more feedback from others on these two dimensions.
> --Proven
>
>
>
> On Wed, Jan 10, 2024 at 12:16 PM Justine Olshan
>  wrote:
>
> > Hmm it seems like Colin and Proven are disagreeing with whether we can
> swap
> > unstable metadata versions.
> >
> > >  When we reorder, we are always allocating a new MV and we are never
> > reusing an existing MV even if it was also unstable.
> >
> > > Given that this is true, there's no reason to have special rules about
> > what we can and can't do with unstable MVs. We can do anything
> >
> > I don't have a strong preference either way, but I think we should agree
> on
> > one approach.
> > The benefit of reordering and reusing is that we can release features
> that
> > are ready earlier and we have more flexibility. With the approach where
> we
> > always create a new MV, I am concerned with having many "empty" MVs. This
> > would encourage waiting until the release before we decide an incomplete
> > feature is not ready and moving its MV into the future. (The
> > abandoning comment I made earlier -- that is consistent with Proven's
> > approach)
> >
> > I think the only potential issue with reordering is that it could be a
> bit
> > confusing and *potentially *prone to errors. Note I say potentially
> because
> > I think it depends on folks' understanding with this new unstable
> metadata
> > version concept. I echo Federico's comments about making sure the risks
> are
> > highlighted.
> >
> > Thanks,
> >
> > Justine
> >
> > On Wed, Jan 10, 2024 at 1:16 AM Federico Valeri 
> > wrote:
> >
> > > Hi folks,
> > >
> > > > If you use an unstable MV, you probably won't be able to upgrade your
> > > software. Because whenever something changes, you'll probably get
> > > serialization exceptions being thrown inside the controller. Fatal
> ones.
> > >
> > > Thanks for this clarification. I think this concrete risk should be
> > > highlighted in the KIP and in the "unstable.metadata.versions.enable"
> > > documentation.
> > >
> > > In the test plan, should we also have one system test checking that
> > > "features with a stable MV will never have that MV changed"?
> > >
> > > On Wed, Jan 10, 2024 at 8:16 AM Colin McCabe 
> wrote:
> > > >
> > > > On Tue, Jan 9, 2024, at 18:56, Proven Provenzano wrote:
> > > > > Hi folks,
> > > > >
> > > > > Thank you for the questions.
> > > > >
> > > > > Let me clarify about reorder first. The reorder of unstable
> metadata
> > > > > versions should be infrequent.
> > > >
> > > > Why does it need to be infrequent? We should be able to reorder
> > unstable
> > > 

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2024-01-05 Thread Artem Livshits
Hi Rowland,

Thank you for the feedback.  For the 2PC cases, the expectation is that the
timeout on the client would be set to "effectively infinite", that would
exceed all practical 2PC delays.  Now I think that this flexibility is
confusing and can be misused, I have updated the KIP to just say that if
2PC is used, the transaction never expires.

-Artem

On Thu, Jan 4, 2024 at 6:14 PM Rowland Smith  wrote:

> It is probably me. I copied the original message subject into a new email.
> Perhaps that is not enough to link them.
>
> It was not my understanding from reading KIP-939 that we are doing away
> with any transactional timeout in the Kafka broker. As I understand it, we
> are allowing the application to set the transaction timeout to a value that
> exceeds the *transaction.max.timeout.ms
> * setting
> on the broker, and having no timeout if the application does not set
> *transaction.timeout.ms
> * on the producer. The KIP says that the
> semantics of *transaction.timeout.ms * are
> not being changed, so I take that to mean that the broker will continue to
> enforce a timeout if provided, and abort transactions that exceed it. From
> the KIP:
>
> Client Configuration Changes
>
> *transaction.two.phase.commit.enable* The default would be ‘false’.  If set
> to ‘true’, then the broker is informed that the client is participating in
> two phase commit protocol and can set transaction timeout to values that
> exceed *transaction.max.timeout.ms *
> setting
> on the broker (if the timeout is not set explicitly on the client and the
> two phase commit is set to ‘true’ then the transaction never expires).
>
> *transaction.timeout.ms * The semantics is
> not changed, but it can be set to values that exceed
> *transaction.max.timeout.ms
> * if two.phase.commit.enable is set to
> ‘true’.
>
>
> Thinking about this more I believe we would also have a possible race
> condition if the broker is unaware that a transaction has been prepared.
> The application might call prepare and get a positive response, but the
> broker might have already aborted the transaction for exceeding the
> timeout. It is a general rule of 2PC that once a transaction has been
> prepared it must be possible for it to be committed or aborted. It seems in
> this case a prepared transaction might already be aborted by the broker, so
> it would be impossible to commit.
>
> I hope this is making sense and I am not misunderstanding the KIP. Please
> let me know if I am.
>
> - Rowland
>
>
> On Thu, Jan 4, 2024 at 12:56 PM Justine Olshan
> 
> wrote:
>
> > Hey Rowland,
> >
> > Not sure why this message showed up in a different thread from the other
> > KIP-939 discussion (is it just me?)
> >
> > In KIP-939, we do away with having any transactional timeout on the Kafka
> > side. The external coordinator is fully responsible for controlling
> whether
> > the transaction completes.
> >
> > While I think there is some use in having a prepare stage, I just wanted
> to
> > clarify what the current KIP is proposing.
> >
> > Thanks,
> > Justine
> >
> > On Wed, Jan 3, 2024 at 7:49 PM Rowland Smith  wrote:
> >
> > > Hi Artem,
> > >
> > > I saw your response in the thread I started discussing Kafka
> distributed
> > > transaction support and the XA interface. I would like to work with you
> > to
> > > add XA support to Kafka on top of the excellent foundational work that
> > you
> > > have started with KIP-939. I agree that explicit XA support should not
> be
> > > included in the Kafka codebase as long as the right set of basic
> > operations
> > > are provided. I will begin pulling together a KIP to follow KIP-939.
> > >
> > > I did have one comment on KIP-939 itself. I see that you considered an
> > > explicit "prepare" RPC, but decided not to add it. If I understand your
> > > design correctly, that would mean that a 2PC transaction would have a
> > > single timeout that would need to be long enough to ensure that
> prepared
> > > transactions are not aborted when an external coordinator fails.
> However,
> > > this also means that an unprepared transaction would not be aborted
> > without
> > > waiting for the same timeout. Since long running transactions block
> > > transactional consumers, having a long timeout for all transactions
> could
> > > be disruptive. An explicit "prepare " RPC would allow the server to
> abort
> > > unprepared transactions after a relatively short timeout, and apply a
> > much
> > > longer timeout only to prepared transactions. The explicit "prepare"
> RPC
> > > would make Kafka server more resilient to client failure at the cost of
> > an
> > > extra synchronous RPC call. I think its worth reconsidering this.
> > >
> > > With an XA implementation this might become a more significant issue
> > since
> > > the transaction 

Re: [DISCUSS] Kafka distributed transaction support

2024-01-03 Thread Artem Livshits
Hi  Rowland,

KIP-939 provides a foundation for using a two-phase commit protocol with
Kafka (allows it to be a participant) that can be used to implement various
concrete protocols, such as XA but not only XA.  The benefit of supporting
a foundational construct (and not just one concrete protocol such as XA) is
that it enables other implementations of two-phase commit to work with
Kafka (e.g. Flink sync operator) and integrate with databases that don't
support XA.

That said, XA is an important protocol that's supported by many databases
and it makes sense to have an XA implementation for Kafka.  The main
consideration, I think, is whether we should build XA implementation into
Kafka directly (which would require a KIP) or just provide an external
utility that would implement XA based on KIP-939 (in this case it wouldn't
change Kafka and wouldn't require a KIP or anyone's approvals).  I'd be
happy to collaborate to drive an XA solution forward either way.

-Artem

On Tue, Jan 2, 2024 at 1:06 PM Justine Olshan 
wrote:

> I believe Artem also had some conversations offline about XA.
>
> If I recall correctly, he didn't plan to include it in KIP-939 but was
> happy to leave room for potential KIPs in the future.
> Please feel free to continue the conversation on the thread. :)
>
> Justine
>
> On Tue, Jan 2, 2024 at 12:05 PM Greg Harris 
> wrote:
>
> > Hi Rowland,
> >
> > First of all, welcome to the community, and thanks for thinking about
> > the future of Kafka!
> >
> > I'm not very familiar with X/Open XA, but from the documentation I
> > read, it appears most related to KIP-939: Support Participation in 2PC
> > [1] currently in-progress. You may be interested in contributing to
> > the discussion [2] for that KIP to ensure that it is easy to use
> > within an XA context. I see that someone else in that thread has
> > mentioned XA, but no conclusions appear to have been reached. I'm sure
> > that Artem would be interested to hear your use-case and vision for
> > using Kafka with XA.
> >
> > [1]:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC
> > [2]: https://lists.apache.org/thread/wbs9sqs3z1tdm7ptw5j4o9osmx9s41nf
> >
> > Thanks!
> > Greg
> >
> >
> > On Mon, Jan 1, 2024 at 2:20 PM Rowland Smith  wrote:
> > >
> > > Hi All,
> > >
> > > I am new to the Kafka developer community. After learning more about
> > > Kafka's transactional capabilities recently, I became interested in
> > > exploring what would be required to provide support for the XA
> interface
> > > specified in the X/ Open Distributed Processing Model in the Kafka
> > producer
> > > client. I have put together a proof of concept to satisfy my curiosity,
> > and
> > > based on that work, I think that extending the Kafka producer with XA
> > > support is doable with reasonable effort.
> > >
> > > As I understand the Kafka development team's process, the first step in
> > the
> > > process would be to produce a KIP describing the feature's goals and
> > > design. My question in this email is whether XA support has ever been
> > > considered previously by the PMC and if so, with what result. I don't
> > want
> > > to spend time working on a KIP if XA support is not something that the
> > PMC
> > > sees value in including and supporting in the Kafka codebase.
> > >
> > > Any feedback would be appreciated. I am excited to work on this feature
> > if
> > > there is interest in the community.
> > >
> > > Regards,
> > > Rowland
> >
>


Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-12-18 Thread Artem Livshits
> called": "either" should be "neither"?
>
> 19. Since InitProducerId always bumps up the epoch, it creates a situation
> where there could be multiple outstanding txns. The following is an example
> of a potential problem during recovery.
>The last txn epoch in the external store is 41 when the app dies.
>Instance1 is created for recovery.
>  1. (instance1) InitProducerId(keepPreparedTxn=true), epoch=42,
> ongoingEpoch=41
>  2. (instance1) dies before completeTxn(41) can be called.
>Instance2 is created for recovery.
>  3. (instance2) InitProducerId(keepPreparedTxn=true), epoch=43,
> ongoingEpoch=42
>  4. (instance2) completeTxn(41) => abort
>The first problem is that 41 now is aborted when it should be committed.
> The second one is that it's not clear who could abort epoch 42, which is
> still open.
>
> Jun
>
>
> On Thu, Dec 7, 2023 at 2:43 PM Justine Olshan  >
> wrote:
>
> > Hey Artem,
> >
> > Thanks for the updates. I think what you say makes sense. I just updated
> my
> > KIP so I want to reconcile some of the changes we made especially with
> > respect to the TransactionLogValue.
> >
> > Firstly, I believe tagged fields require a default value so that if they
> > are not filled, we return the default (and know that they were empty).
> For
> > my KIP, I proposed the default for producer ID tagged fields should be
> -1.
> > I was wondering if we could update the KIP to include the default values
> > for producer ID and epoch.
> >
> > Next, I noticed we decided to rename the fields. I guess that the field
> > "NextProducerId" in my KIP correlates to "ProducerId" in this KIP. Is
> that
> > correct? So we would have "TransactionProducerId" for the non-tagged
> field
> > and have "ProducerId" (NextProducerId) and "PrevProducerId" as tagged
> > fields the final version after KIP-890 and KIP-936 are implemented. Is
> this
> > correct? I think the tags will need updating, but that is trivial.
> >
> > The final question I had was with respect to storing the new epoch. In
> > KIP-890 part 2 (epoch bumps) I think we concluded that we don't need to
> > store the epoch since we can interpret the previous epoch based on the
> > producer ID. But here we could call the InitProducerId multiple times and
> > we only want the producer with the correct epoch to be able to commit the
> > transaction. Is that the correct reasoning for why we need epoch here but
> > not the Prepare/Commit state.
> >
> > Thanks,
> > Justine
> >
> > On Wed, Nov 22, 2023 at 9:48 AM Artem Livshits
> >  wrote:
> >
> > > Hi Justine,
> > >
> > > After thinking a bit about supporting atomic dual writes for Kafka +
> > NoSQL
> > > database, I came to a conclusion that we do need to bump the epoch even
> > > with InitProducerId(keepPreparedTxn=true).  As I described in my
> previous
> > > email, we wouldn't need to bump the epoch to protect from zombies so
> that
> > > reasoning is still true.  But we cannot protect from split-brain
> > scenarios
> > > when two or more instances of a producer with the same transactional id
> > try
> > > to produce at the same time.  The dual-write example for SQL databases
> (
> > > https://github.com/apache/kafka/pull/14231/files) doesn't have a
> > > split-brain problem because execution is protected by the update lock
> on
> > > the transaction state record; however NoSQL databases may not have this
> > > protection (I'll write an example for NoSQL database dual-write soon).
> > >
> > > In a nutshell, here is an example of a split-brain scenario:
> > >
> > >1. (instance1) InitProducerId(keepPreparedTxn=true), got epoch=42
> > >2. (instance2) InitProducerId(keepPreparedTxn=true), got epoch=42
> > >3. (instance1) CommitTxn, epoch bumped to 43
> > >4. (instance2) CommitTxn, this is considered a retry, so it got
> epoch
> > 43
> > >as well
> > >5. (instance1) Produce messageA w/sequence 1
> > >6. (instance2) Produce messageB w/sequence 1, this is considered a
> > >duplicate
> > >7. (instance2) Produce messageC w/sequence 2
> > >8. (instance1) Produce messageD w/sequence 2, this is considered a
> > >duplicate
> > >
> > > Now if either of those commit the transaction, it would have a mix of
> > > messages from the two instances (messageA and messageC).  With the
> proper
> > &

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-12-18 Thread Artem Livshits
Hi Justine,

I've updated the KIP based on the KIP-890 updates.  Now KIP-939 only needs
to add one tagged field NextProducerEpoch as the other required fields will
be added as part of KIP-890.

> But here we could call the InitProducerId multiple times and we only want
the producer with the correct epoch to be able to commit the transaction

That's correct, the epoch cannot be inferred from the state in this case
because InitProducerId can be called multiple times.  I've also added an
example in the KIP that walks through the epoch overflow scenarios.

-Artem


On Thu, Dec 7, 2023 at 2:43 PM Justine Olshan 
wrote:

> Hey Artem,
>
> Thanks for the updates. I think what you say makes sense. I just updated my
> KIP so I want to reconcile some of the changes we made especially with
> respect to the TransactionLogValue.
>
> Firstly, I believe tagged fields require a default value so that if they
> are not filled, we return the default (and know that they were empty). For
> my KIP, I proposed the default for producer ID tagged fields should be -1.
> I was wondering if we could update the KIP to include the default values
> for producer ID and epoch.
>
> Next, I noticed we decided to rename the fields. I guess that the field
> "NextProducerId" in my KIP correlates to "ProducerId" in this KIP. Is that
> correct? So we would have "TransactionProducerId" for the non-tagged field
> and have "ProducerId" (NextProducerId) and "PrevProducerId" as tagged
> fields the final version after KIP-890 and KIP-936 are implemented. Is this
> correct? I think the tags will need updating, but that is trivial.
>
> The final question I had was with respect to storing the new epoch. In
> KIP-890 part 2 (epoch bumps) I think we concluded that we don't need to
> store the epoch since we can interpret the previous epoch based on the
> producer ID. But here we could call the InitProducerId multiple times and
> we only want the producer with the correct epoch to be able to commit the
> transaction. Is that the correct reasoning for why we need epoch here but
> not the Prepare/Commit state.
>
> Thanks,
> Justine
>
> On Wed, Nov 22, 2023 at 9:48 AM Artem Livshits
>  wrote:
>
> > Hi Justine,
> >
> > After thinking a bit about supporting atomic dual writes for Kafka +
> NoSQL
> > database, I came to a conclusion that we do need to bump the epoch even
> > with InitProducerId(keepPreparedTxn=true).  As I described in my previous
> > email, we wouldn't need to bump the epoch to protect from zombies so that
> > reasoning is still true.  But we cannot protect from split-brain
> scenarios
> > when two or more instances of a producer with the same transactional id
> try
> > to produce at the same time.  The dual-write example for SQL databases (
> > https://github.com/apache/kafka/pull/14231/files) doesn't have a
> > split-brain problem because execution is protected by the update lock on
> > the transaction state record; however NoSQL databases may not have this
> > protection (I'll write an example for NoSQL database dual-write soon).
> >
> > In a nutshell, here is an example of a split-brain scenario:
> >
> >1. (instance1) InitProducerId(keepPreparedTxn=true), got epoch=42
> >2. (instance2) InitProducerId(keepPreparedTxn=true), got epoch=42
> >3. (instance1) CommitTxn, epoch bumped to 43
> >4. (instance2) CommitTxn, this is considered a retry, so it got epoch
> 43
> >as well
> >5. (instance1) Produce messageA w/sequence 1
> >6. (instance2) Produce messageB w/sequence 1, this is considered a
> >duplicate
> >7. (instance2) Produce messageC w/sequence 2
> >8. (instance1) Produce messageD w/sequence 2, this is considered a
> >duplicate
> >
> > Now if either of those commit the transaction, it would have a mix of
> > messages from the two instances (messageA and messageC).  With the proper
> > epoch bump, instance1 would get fenced at step 3.
> >
> > In order to update epoch in InitProducerId(keepPreparedTxn=true) we need
> to
> > preserve the ongoing transaction's epoch (and producerId, if the epoch
> > overflows), because we'd need to make a correct decision when we compare
> > the PreparedTxnState that we read from the database with the (producerId,
> > epoch) of the ongoing transaction.
> >
> > I've updated the KIP with the following:
> >
> >- Ongoing transaction now has 2 (producerId, epoch) pairs -- one pair
> >describes the ongoing transaction, the other pair describes expected
> > epoch
> >for operations on this transactional id
> >- InitProducerI

[VOTE] KIP-939: Support Participation in 2PC

2023-12-01 Thread Artem Livshits
Hello,

This is a voting thread for
https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC
.

The KIP proposes extending Kafka transaction support (that already uses 2PC
under the hood) to enable atomicity of dual writes to Kafka and an external
database, and helps to fix a long standing Flink issue.

An example of code that uses the dual write recipe with JDBC and should
work for most SQL databases is here
https://github.com/apache/kafka/pull/14231.

The FLIP for the sister fix in Flink is here
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=255071710

-Artem


Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-11-22 Thread Artem Livshits
Hi Justine,

After thinking a bit about supporting atomic dual writes for Kafka + NoSQL
database, I came to a conclusion that we do need to bump the epoch even
with InitProducerId(keepPreparedTxn=true).  As I described in my previous
email, we wouldn't need to bump the epoch to protect from zombies so that
reasoning is still true.  But we cannot protect from split-brain scenarios
when two or more instances of a producer with the same transactional id try
to produce at the same time.  The dual-write example for SQL databases (
https://github.com/apache/kafka/pull/14231/files) doesn't have a
split-brain problem because execution is protected by the update lock on
the transaction state record; however NoSQL databases may not have this
protection (I'll write an example for NoSQL database dual-write soon).

In a nutshell, here is an example of a split-brain scenario:

   1. (instance1) InitProducerId(keepPreparedTxn=true), got epoch=42
   2. (instance2) InitProducerId(keepPreparedTxn=true), got epoch=42
   3. (instance1) CommitTxn, epoch bumped to 43
   4. (instance2) CommitTxn, this is considered a retry, so it got epoch 43
   as well
   5. (instance1) Produce messageA w/sequence 1
   6. (instance2) Produce messageB w/sequence 1, this is considered a
   duplicate
   7. (instance2) Produce messageC w/sequence 2
   8. (instance1) Produce messageD w/sequence 2, this is considered a
   duplicate

Now if either of those commit the transaction, it would have a mix of
messages from the two instances (messageA and messageC).  With the proper
epoch bump, instance1 would get fenced at step 3.

In order to update epoch in InitProducerId(keepPreparedTxn=true) we need to
preserve the ongoing transaction's epoch (and producerId, if the epoch
overflows), because we'd need to make a correct decision when we compare
the PreparedTxnState that we read from the database with the (producerId,
epoch) of the ongoing transaction.

I've updated the KIP with the following:

   - Ongoing transaction now has 2 (producerId, epoch) pairs -- one pair
   describes the ongoing transaction, the other pair describes expected epoch
   for operations on this transactional id
   - InitProducerIdResponse now returns 2 (producerId, epoch) pairs
   - TransactionalLogValue now has 2 (producerId, epoch) pairs, the new
   values added as tagged fields, so it's easy to downgrade
   - Added a note about downgrade in the Compatibility section
   - Added a rejected alternative

-Artem

On Fri, Oct 6, 2023 at 5:16 PM Artem Livshits 
wrote:

> Hi Justine,
>
> Thank you for the questions.  Currently (pre-KIP-939) we always bump the
> epoch on InitProducerId and abort an ongoing transaction (if any).  I
> expect this behavior will continue with KIP-890 as well.
>
> With KIP-939 we need to support the case when the ongoing transaction
> needs to be preserved when keepPreparedTxn=true.  Bumping epoch without
> aborting or committing a transaction is tricky because epoch is a short
> value and it's easy to overflow.  Currently, the overflow case is handled
> by aborting the ongoing transaction, which would send out transaction
> markers with epoch=Short.MAX_VALUE to the partition leaders, which would
> fence off any messages with the producer id that started the transaction
> (they would have epoch that is less than Short.MAX_VALUE).  Then it is safe
> to allocate a new producer id and use it in new transactions.
>
> We could say that maybe when keepPreparedTxn=true we bump epoch unless it
> leads to overflow, and don't bump epoch in the overflow case.  I don't
> think it's a good solution because if it's not safe to keep the same epoch
> when keepPreparedTxn=true, then we must handle the epoch overflow case as
> well.  So either we should convince ourselves that it's safe to keep the
> epoch and do it in the general case, or we always bump the epoch and handle
> the overflow.
>
> With KIP-890, we bump the epoch on every transaction commit / abort.  This
> guarantees that even if InitProducerId(keepPreparedTxn=true) doesn't
> increment epoch on the ongoing transaction, the client will have to call
> commit or abort to finish the transaction and will increment the epoch (and
> handle epoch overflow, if needed).  If the ongoing transaction was in a bad
> state and had some zombies waiting to arrive, the abort operation would
> fence them because with KIP-890 every abort would bump the epoch.
>
> We could also look at this from the following perspective.  With KIP-890,
> zombies won't be able to cross transaction boundaries; each transaction
> completion creates a boundary and any activity in the past gets confined in
> the boundary.  Then data in any partition would look like this:
>
> 1. message1, epoch=42
> 2. message2, epoch=42
> 3. message3, epoch=42
> 4. marker (commit or abort), epoch=43
>
> Now if we inject steps 3a and 3b like

Re: [DISCUSS] KIP-994: Minor Enhancements to ListTransactions and DescribeTransactions APIs

2023-11-16 Thread Artem Livshits
Hi Raman,

I see that you've updated the KIP.  The content looks good to me.

A couple nits on the format:
- can you highlight which fields are new in the message?
- can you add your original proposal of using a tagged field in
ListTransactionsRequest to the list of rejected alternatives?

-Artem

On Tue, Nov 7, 2023 at 11:04 PM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Artem,
> I think you make a very good point. This also looks to me like it deserves
> a version bump for the request.
>
> Andrew
>
> > On 8 Nov 2023, at 04:12, Artem Livshits 
> wrote:
> >
> > Hi Raman,
> >
> > Thank you for the KIP.  I think using the tagged field
> > in DescribeTransactionsResponse should be good -- if either the client or
> > the server don't support it, it's not printed, which is reasonable
> behavior.
> >
> > For the ListTransactionsRequest, though, I think using the tagged field
> > could lead to a subtle compatibility issue if a new client is used with
> old
> > server: the client could specify the DurationFilter, but the old server
> > would ignore it and list all transactions instead, which could be
> > misleading or potentially even dangerous if the results are used in a
> > script for some automation.  I think a more desirable behavior would be
> to
> > fail if the server doesn't support the new filter, which we should be
> able
> > to achieve if we bump version of the ListTransactionsRequest and add
> > DurationFilter as a regular field.
> >
> > -Artem
> >
> > On Tue, Nov 7, 2023 at 2:20 AM Raman Verma 
> wrote:
> >
> >> I would like to start a discussion on KIP-994
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-994%3A+Minor+Enhancements+to+ListTransactions+and+DescribeTransactions+APIs
> >>
>
>


Re: [DISCUSS] KIP-994: Minor Enhancements to ListTransactions and DescribeTransactions APIs

2023-11-07 Thread Artem Livshits
Hi Raman,

Thank you for the KIP.  I think using the tagged field
in DescribeTransactionsResponse should be good -- if either the client or
the server don't support it, it's not printed, which is reasonable behavior.

For the ListTransactionsRequest, though, I think using the tagged field
could lead to a subtle compatibility issue if a new client is used with old
server: the client could specify the DurationFilter, but the old server
would ignore it and list all transactions instead, which could be
misleading or potentially even dangerous if the results are used in a
script for some automation.  I think a more desirable behavior would be to
fail if the server doesn't support the new filter, which we should be able
to achieve if we bump version of the ListTransactionsRequest and add
DurationFilter as a regular field.

-Artem

On Tue, Nov 7, 2023 at 2:20 AM Raman Verma  wrote:

> I would like to start a discussion on KIP-994
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-994%3A+Minor+Enhancements+to+ListTransactions+and+DescribeTransactions+APIs
>


Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-10-06 Thread Artem Livshits
Hi Raman,

Thank you for the questions.  Given that the primary effect of setting
enable2pc flag is disabling timeout, it makes sense to make enable2pc have
similar behavior w.r.t. when it can be set.

One clarification about the Ongoing case -- the current (pre-KIP-939)
behavior is to abort ongoing transaction and let the client retry
(eventually getting into CompleteAbort state), so even though transaction
timeout is not changed when actually hitting the ongoing transaction, the
new timeout value would take effect before the call completes to the
caller.  So if we look from the caller perspective, the transaction timeout
is set whenever the InitProducerId functionality is used.

-Artem

On Wed, Oct 4, 2023 at 8:58 PM Raman Verma  wrote:

> Hello Artem,
>
> Now that `InitProducerIdRequest` will have an extra parameter (enable2PC),
> can the client change the value of this parameter during an ongoing
> transaction.
>
> Here is how the transaction coordinator responds to InitProducerId requests
> according
> to the current transaction's state.
>
> - Empty | CompleteAbort | CompleteCommit
> Bump epoch and move to Empty state. Accept any changes from incoming
> InitProducerId
> request like transactionTimeoutMs
>
> - Ongoing
> Bump epoch and move to PrepareEpochFence state. Transaction time out is not
> changed.
>
> - PrepareAbort | PrepareCommit
> No changes internally. Return Concurrent transactions error to the client.
>
> I guess we should allow the same behavior for mutating enable2PC flag
> under these conditions as for transaction timeout value.
>


Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-10-06 Thread Artem Livshits
Hi Justine,

Thank you for the questions.  Currently (pre-KIP-939) we always bump the
epoch on InitProducerId and abort an ongoing transaction (if any).  I
expect this behavior will continue with KIP-890 as well.

With KIP-939 we need to support the case when the ongoing transaction needs
to be preserved when keepPreparedTxn=true.  Bumping epoch without aborting
or committing a transaction is tricky because epoch is a short value and
it's easy to overflow.  Currently, the overflow case is handled by aborting
the ongoing transaction, which would send out transaction markers with
epoch=Short.MAX_VALUE to the partition leaders, which would fence off any
messages with the producer id that started the transaction (they would have
epoch that is less than Short.MAX_VALUE).  Then it is safe to allocate a
new producer id and use it in new transactions.

We could say that maybe when keepPreparedTxn=true we bump epoch unless it
leads to overflow, and don't bump epoch in the overflow case.  I don't
think it's a good solution because if it's not safe to keep the same epoch
when keepPreparedTxn=true, then we must handle the epoch overflow case as
well.  So either we should convince ourselves that it's safe to keep the
epoch and do it in the general case, or we always bump the epoch and handle
the overflow.

With KIP-890, we bump the epoch on every transaction commit / abort.  This
guarantees that even if InitProducerId(keepPreparedTxn=true) doesn't
increment epoch on the ongoing transaction, the client will have to call
commit or abort to finish the transaction and will increment the epoch (and
handle epoch overflow, if needed).  If the ongoing transaction was in a bad
state and had some zombies waiting to arrive, the abort operation would
fence them because with KIP-890 every abort would bump the epoch.

We could also look at this from the following perspective.  With KIP-890,
zombies won't be able to cross transaction boundaries; each transaction
completion creates a boundary and any activity in the past gets confined in
the boundary.  Then data in any partition would look like this:

1. message1, epoch=42
2. message2, epoch=42
3. message3, epoch=42
4. marker (commit or abort), epoch=43

Now if we inject steps 3a and 3b like this:

1. message1, epoch=42
2. message2, epoch=42
3. message3, epoch=42
3a. crash
3b. InitProducerId(keepPreparedTxn=true)
4. marker (commit or abort), epoch=43

The invariant still holds even with steps 3a and 3b -- whatever activity
was in the past will get confined in the past with mandatory abort / commit
that must follow  InitProducerId(keepPreparedTxn=true).

So KIP-890 provides the proper isolation between transactions, so injecting
crash + InitProducerId(keepPreparedTxn=true) into the transaction sequence
is safe from the zombie protection perspective.

That said, I'm still thinking about it and looking for cases that might
break because we don't bump epoch when
InitProducerId(keepPreparedTxn=true), if such cases exist, we'll need to
develop the logic to handle epoch overflow for ongoing transactions.

-Artem



On Tue, Oct 3, 2023 at 10:15 AM Justine Olshan 
wrote:

> Hey Artem,
>
> Thanks for the KIP. I had a question about epoch bumping.
>
> Previously when we send an InitProducerId request on Producer startup, we
> bump the epoch and abort the transaction. Is it correct to assume that we
> will still bump the epoch, but just not abort the transaction?
> If we still bump the epoch in this case, how does this interact with
> KIP-890 where we also bump the epoch on every transaction. (I think this
> means that we may skip epochs and the data itself will all have the same
> epoch)
>
> I may have follow ups depending on the answer to this. :)
>
> Thanks,
> Justine
>
> On Thu, Sep 7, 2023 at 9:51 PM Artem Livshits
>  wrote:
>
> > Hi Alex,
> >
> > Thank you for your questions.
> >
> > > the purpose of having broker-level transaction.two.phase.commit.enable
> >
> > The thinking is that 2PC is a bit of an advanced construct so enabling
> 2PC
> > in a Kafka cluster should be an explicit decision.  If it is set to
> 'false'
> > InitiProducerId (and initTransactions) would
> > return TRANSACTIONAL_ID_AUTHORIZATION_FAILED.
> >
> > > WDYT about adding an AdminClient method that returns the state of
> > transaction.two.phase.commit.enable
> >
> > I wonder if the client could just try to use 2PC and then handle the
> error
> > (e.g. if it needs to fall back to ordinary transactions).  This way it
> > could uniformly handle cases when Kafka cluster doesn't support 2PC
> > completely and cases when 2PC is restricted to certain users.  We could
> > also expose this config in describeConfigs, if the fallback approach
> > doesn't work for some scenarios.
> >
> > -Artem
>

Re: [DISCUSS] KIP-966: Eligible Leader Replicas

2023-10-03 Thread Artem Livshits
Hi Colin,

I think in your example "do_unclean_recovery" would need to do different
things depending on the strategy.

do_unclean_recovery() {
   if (unclean.recovery.manager.enabled) {
if (strategy == Aggressive)
  use UncleanRecoveryManager(waitLastKnownERL=false)  // just inspect
logs from whoever is available
else
  use  UncleanRecoveryManager(waitLastKnownERL=true)  // must wait for
at least last known ELR
  } else {
if (strategy == Aggressive)
  choose the last known leader if that is available, or a random leader
if not)
else
  wait for last known leader to get back
  }
}

The idea is that the Aggressive strategy would kick in as soon as we lost
the leader and would pick a leader from whoever is available; but the
Balanced will only kick in when ELR is empty and will wait for the brokers
that likely have most data to be available.

On Tue, Oct 3, 2023 at 3:04 PM Colin McCabe  wrote:

> On Tue, Oct 3, 2023, at 10:49, Jun Rao wrote:
> > Hi, Calvin,
> >
> > Thanks for the update KIP. A few more comments.
> >
> > 41. Why would a user choose the option to select a random replica as the
> > leader instead of using unclean.recovery.strateg=Aggressive? It seems
> that
> > the latter is strictly better? If that's not the case, could we fold this
> > option under unclean.recovery.strategy instead of introducing a separate
> > config?
>
> Hi Jun,
>
> I thought the flow of control was:
>
> If there is no leader for the partition {
>   If (there are unfenced ELR members) {
> choose_an_unfenced_ELR_member
>   } else if (there are fenced ELR members AND strategy=Aggressive) {
> do_unclean_recovery
>   } else if (there are no ELR members AND strategy != None) {
> do_unclean_recovery
>   } else {
> do nothing about the missing leader
>   }
> }
>
> do_unclean_recovery() {
>if (unclean.recovery.manager.enabled) {
> use UncleanRecoveryManager
>   } else {
> choose the last known leader if that is available, or a random leader
> if not)
>   }
> }
>
> However, I think this could be clarified, especially the behavior when
> unclean.recovery.manager.enabled=false. Inuitively the goal for
> unclean.recovery.manager.enabled=false is to be "the same as now, mostly"
> but it's very underspecified in the KIP, I agree.
>
> >
> > 50. ElectLeadersRequest: "If more than 20 topics are included, only the
> > first 20 will be served. Others will be returned with DesiredLeaders."
> Hmm,
> > not sure that I understand this. ElectLeadersResponse doesn't have a
> > DesiredLeaders field.
> >
> > 51. GetReplicaLogInfo: "If more than 2000 partitions are included, only
> the
> > first 2000 will be served" Do we return an error for the remaining
> > partitions? Actually, should we include an errorCode field at the
> partition
> > level in GetReplicaLogInfoResponse to cover non-existing partitions and
> no
> > authorization, etc?
> >
> > 52. The entry should matches => The entry should match
> >
> > 53. ElectLeadersRequest.DesiredLeaders: Should it be nullable since a
> user
> > may not specify DesiredLeaders?
> >
> > 54. Downgrade: Is that indeed possible? I thought earlier you said that
> > once the new version of the records are in the metadata log, one can't
> > downgrade since the old broker doesn't know how to parse the new version
> of
> > the metadata records?
> >
>
> MetadataVersion downgrade is currently broken but we have fixing it on our
> plate for Kafka 3.7.
>
> The way downgrade works is that "new features" are dropped, leaving only
> the old ones.
>
> > 55. CleanShutdownFile: Should we add a version field for future
> extension?
> >
> > 56. Config changes are public facing. Could we have a separate section to
> > document all the config changes?
>
> +1. A separate section for this would be good.
>
> best,
> Colin
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Sep 25, 2023 at 4:29 PM Calvin Liu 
> > wrote:
> >
> >> Hi Jun
> >> Thanks for the comments.
> >>
> >> 40. If we change to None, it is not guaranteed for no data loss. For
> users
> >> who are not able to validate the data with external resources, manual
> >> intervention does not give a better result but a loss of availability.
> So
> >> practically speaking, the Balance mode would be a better default value.
> >>
> >> 41. No, it represents how we want to do the unclean leader election. If
> it
> >> is false, the unclean leader election will be the old random way.
> >> Otherwise, the unclean recovery will be used.
> >>
> >> 42. Good catch. Updated.
> >>
> >> 43. Only the first 20 topics will be served. Others will be returned
> with
> >> InvalidRequestError
> >>
> >> 44. The order matters. The desired leader entries match with the topic
> >> partition list by the index.
> >>
> >> 45. Thanks! Updated.
> >>
> >> 46. Good advice! Updated.
> >>
> >> 47.1, updated the comment. Basically it will elect the replica in the
> >> desiredLeader field to be the leader
> >>
> >> 47.2 We can let the admin client do the 

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-09-07 Thread Artem Livshits
Hi Alex,

Thank you for your questions.

> the purpose of having broker-level transaction.two.phase.commit.enable

The thinking is that 2PC is a bit of an advanced construct so enabling 2PC
in a Kafka cluster should be an explicit decision.  If it is set to 'false'
InitiProducerId (and initTransactions) would
return TRANSACTIONAL_ID_AUTHORIZATION_FAILED.

> WDYT about adding an AdminClient method that returns the state of
transaction.two.phase.commit.enable

I wonder if the client could just try to use 2PC and then handle the error
(e.g. if it needs to fall back to ordinary transactions).  This way it
could uniformly handle cases when Kafka cluster doesn't support 2PC
completely and cases when 2PC is restricted to certain users.  We could
also expose this config in describeConfigs, if the fallback approach
doesn't work for some scenarios.

-Artem


On Tue, Sep 5, 2023 at 12:45 PM Alexander Sorokoumov
 wrote:

> Hi Artem,
>
> Thanks for publishing this KIP!
>
> Can you please clarify the purpose of having broker-level
> transaction.two.phase.commit.enable config in addition to the new ACL? If
> the brokers are configured with transaction.two.phase.commit.enable=false,
> at what point will a client configured with
> transaction.two.phase.commit.enable=true fail? Will it happen at
> KafkaProducer#initTransactions?
>
> WDYT about adding an AdminClient method that returns the state of t
> ransaction.two.phase.commit.enable? This way, clients would know in advance
> if 2PC is enabled on the brokers.
>
> Best,
> Alex
>
> On Fri, Aug 25, 2023 at 9:40 AM Roger Hoover 
> wrote:
>
> > Other than supporting multiplexing transactional streams on a single
> > producer, I don't see how to improve it.
> >
> > On Thu, Aug 24, 2023 at 12:12 PM Artem Livshits
> >  wrote:
> >
> > > Hi Roger,
> > >
> > > Thank you for summarizing the cons.  I agree and I'm curious what would
> > be
> > > the alternatives to solve these problems better and if they can be
> > > incorporated into this proposal (or built independently in addition to
> or
> > > on top of this proposal).  E.g. one potential extension we discussed
> > > earlier in the thread could be multiplexing logical transactional
> > "streams"
> > > with a single producer.
> > >
> > > -Artem
> > >
> > > On Wed, Aug 23, 2023 at 4:50 PM Roger Hoover 
> > > wrote:
> > >
> > > > Thanks.  I like that you're moving Kafka toward supporting this
> > > dual-write
> > > > pattern.  Each use case needs to consider the tradeoffs.  You already
> > > > summarized the pros very well in the KIP.  I would summarize the cons
> > > > as follows:
> > > >
> > > > - you sacrifice availability - each write requires both DB and Kafka
> to
> > > be
> > > > available so I think your overall application availability is 1 -
> p(DB
> > is
> > > > unavailable)*p(Kafka is unavailable).
> > > > - latency will be higher and throughput lower - each write requires
> > both
> > > > writes to DB and Kafka while holding an exclusive lock in DB.
> > > > - you need to create a producer per unit of concurrency in your app
> > which
> > > > has some overhead in the app and Kafka side (number of connections,
> > poor
> > > > batching).  I assume the producers would need to be configured for
> low
> > > > latency (linger.ms=0)
> > > > - there's some complexity in managing stable transactional ids for
> each
> > > > producer/concurrency unit in your application.  With k8s deployment,
> > you
> > > > may need to switch to something like a StatefulSet that gives each
> pod
> > a
> > > > stable identity across restarts.  On top of that pod identity which
> you
> > > can
> > > > use as a prefix, you then assign unique transactional ids to each
> > > > concurrency unit (thread/goroutine).
> > > >
> > > > On Wed, Aug 23, 2023 at 12:53 PM Artem Livshits
> > > >  wrote:
> > > >
> > > > > Hi Roger,
> > > > >
> > > > > Thank you for the feedback.  You make a very good point that we
> also
> > > > > discussed internally.  Adding support for multiple concurrent
> > > > > transactions in one producer could be valuable but it seems to be a
> > > > fairly
> > > > > large and independent change that would deserve a separate KIP.  If
> > > such
> > > > > support is added we 

Re: [DISCUSS] KIP-966: Eligible Leader Replicas

2023-09-07 Thread Artem Livshits
Hi Calvin,

Thanks for the KIP.  The new ELR protocol looks good to me.  I have some
questions about unclean recovery, specifically in "balanced" mode:

1. The KIP mentions that the controller would trigger unclear recovery when
the leader is fenced, but my understanding is that when a leader is fenced,
it would get into ELR.  Would it be more precise to say that an unclear
leader election is triggered when the last member of ELR gets unfenced and
registers with unclean shutdown?
2. For balanced mode, we need replies from at least LastKnownELR, in which
case, does it make sense to start unclean recovery if some of the
LastKnownELR are fenced?
3. "The URM takes the partition info to initiate an unclear recovery task
..." the parameters are topic-partition and replica ids -- what are those?
Would those be just the whole replica assignment or just LastKnownELR?

-Artem

On Thu, Aug 10, 2023 at 3:47 PM Calvin Liu 
wrote:

> Hi everyone,
> I'd like to discuss a series of enhancement to the replication protocol.
>
> A partition replica can experience local data loss in unclean shutdown
> scenarios where unflushed data in the OS page cache is lost - such as an
> availability zone power outage or a server error. The Kafka replication
> protocol is designed to handle these situations by removing such replicas
> from the ISR and only re-adding them once they have caught up and therefore
> recovered any lost data. This prevents replicas that lost an arbitrary log
> suffix, which included committed data, from being elected leader.
> However, there is a "last replica standing" state which when combined with
> a data loss unclean shutdown event can turn a local data loss scenario into
> a global data loss scenario, i.e., committed data can be removed from all
> replicas. When the last replica in the ISR experiences an unclean shutdown
> and loses committed data, it will be reelected leader after starting up
> again, causing rejoining followers to truncate their logs and thereby
> removing the last copies of the committed records which the leader lost
> initially.
>
> The new KIP will maximize the protection and provides MinISR-1 tolerance to
> data loss unclean shutdown events.
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas
>


Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-08-24 Thread Artem Livshits
Hi Roger,

Thank you for summarizing the cons.  I agree and I'm curious what would be
the alternatives to solve these problems better and if they can be
incorporated into this proposal (or built independently in addition to or
on top of this proposal).  E.g. one potential extension we discussed
earlier in the thread could be multiplexing logical transactional "streams"
with a single producer.

-Artem

On Wed, Aug 23, 2023 at 4:50 PM Roger Hoover  wrote:

> Thanks.  I like that you're moving Kafka toward supporting this dual-write
> pattern.  Each use case needs to consider the tradeoffs.  You already
> summarized the pros very well in the KIP.  I would summarize the cons
> as follows:
>
> - you sacrifice availability - each write requires both DB and Kafka to be
> available so I think your overall application availability is 1 - p(DB is
> unavailable)*p(Kafka is unavailable).
> - latency will be higher and throughput lower - each write requires both
> writes to DB and Kafka while holding an exclusive lock in DB.
> - you need to create a producer per unit of concurrency in your app which
> has some overhead in the app and Kafka side (number of connections, poor
> batching).  I assume the producers would need to be configured for low
> latency (linger.ms=0)
> - there's some complexity in managing stable transactional ids for each
> producer/concurrency unit in your application.  With k8s deployment, you
> may need to switch to something like a StatefulSet that gives each pod a
> stable identity across restarts.  On top of that pod identity which you can
> use as a prefix, you then assign unique transactional ids to each
> concurrency unit (thread/goroutine).
>
> On Wed, Aug 23, 2023 at 12:53 PM Artem Livshits
>  wrote:
>
> > Hi Roger,
> >
> > Thank you for the feedback.  You make a very good point that we also
> > discussed internally.  Adding support for multiple concurrent
> > transactions in one producer could be valuable but it seems to be a
> fairly
> > large and independent change that would deserve a separate KIP.  If such
> > support is added we could modify 2PC functionality to incorporate that.
> >
> > > Maybe not too bad but a bit of pain to manage these ids inside each
> > process and across all application processes.
> >
> > I'm not sure if supporting multiple transactions in one producer would
> make
> > id management simpler: we'd need to store a piece of data per
> transaction,
> > so whether it's N producers with a single transaction or N transactions
> > with a single producer, it's still roughly the same amount of data to
> > manage.  In fact, managing transactional ids (current proposal) might be
> > easier, because the id is controlled by the application and it knows how
> to
> > complete the transaction after crash / restart; while a TID would be
> > generated by Kafka and that would create a question of starting Kafka
> > transaction, but not saving its TID and then crashing, then figuring out
> > which transactions to abort and etc.
> >
> > > 2) creating a separate producer for each concurrency slot in the
> > application
> >
> > This is a very valid concern.  Maybe we'd need to have some multiplexing
> of
> > transactional logical "streams" over the same connection.  Seems like a
> > separate KIP, though.
> >
> > > Otherwise, it seems you're left with single-threaded model per
> > application process?
> >
> > That's a fair assessment.  Not necessarily exactly single-threaded per
> > application, but a single producer per thread model (i.e. an application
> > could have a pool of threads + producers to increase concurrency).
> >
> > -Artem
> >
> > On Tue, Aug 22, 2023 at 7:22 PM Roger Hoover 
> > wrote:
> >
> > > Artem,
> > >
> > > Thanks for the reply.
> > >
> > > If I understand correctly, Kafka does not support concurrent
> transactions
> > > from the same producer (transactional id).  I think this means that
> > > applications that want to support in-process concurrency (say
> > thread-level
> > > concurrency with row-level DB locking) would need to manage separate
> > > transactional ids and producers per thread and then store txn state
> > > accordingly.   The potential usability downsides I see are
> > > 1) managing a set of transactional ids for each application process
> that
> > > scales up to it's max concurrency.  Maybe not too bad but a bit of pain
> > to
> > > manage these ids inside each process and across all application
> > processes.
> > > 2) crea

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-08-24 Thread Artem Livshits
Hi Guy,

You raise a very good point.  Supporting XA sounds like a good way to
integrate Kafka and it's something that I think we should support at
some point in the future.  For this KIP, though, we thought we focus on a
more basic functionality keeping the following in mind:

1. XA is not universally supported and it would be good to integrate with
systems that just have local transactions support (which would include
systems that are not "proper" databases, like Zookeeper, RocksDB, etc.).
2. More advanced functionality, like XA, we should be able to implement on
top of KIP-939 as a library or a recipe.

I would like to hear your thoughts on #2 specifically -- do you think that
we actually need to amend KIP-939 to enable XA in the future, or could the
XA changes be done incrementally on top of KIP-939?

-Artem

On Wed, Aug 23, 2023 at 5:23 AM  wrote:

> Hi,
>
> Nice idea, but you could maximise compatibility if you adhere to XA
> standard APIs rather than Kafka internal APIs.
>
> We at Atomikos offer 2PC coordination and recovery and we are happy to
> help you design this, it's a service we usually offer for free to backend
> vendors / systems.
>
> Let me know if you'd like to explore?
>
> Guy
>
>
> On 2023/08/17 06:39:57 Artem Livshits wrote:
> > Hello,
> >
> >  This is a discussion thread for
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC
> >  .
> >
> >  The KIP proposes extending Kafka transaction support (that already uses
> 2PC
> >  under the hood) to enable atomicity of dual writes to Kafka and an
> external
> >  database, and helps to fix a long standing Flink issue.
> >
> >  An example of code that uses the dual write recipe with JDBC and should
> >  work for most SQL databases is here
> >  https://github.com/apache/kafka/pull/14231.
> >
> >  The FLIP for the sister fix in Flink is here
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=255071710
> >
> >  -Artem
>
>
> Sent with Spark
>


Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-08-23 Thread Artem Livshits
Hi Roger,

Thank you for the feedback.  You make a very good point that we also
discussed internally.  Adding support for multiple concurrent
transactions in one producer could be valuable but it seems to be a fairly
large and independent change that would deserve a separate KIP.  If such
support is added we could modify 2PC functionality to incorporate that.

> Maybe not too bad but a bit of pain to manage these ids inside each
process and across all application processes.

I'm not sure if supporting multiple transactions in one producer would make
id management simpler: we'd need to store a piece of data per transaction,
so whether it's N producers with a single transaction or N transactions
with a single producer, it's still roughly the same amount of data to
manage.  In fact, managing transactional ids (current proposal) might be
easier, because the id is controlled by the application and it knows how to
complete the transaction after crash / restart; while a TID would be
generated by Kafka and that would create a question of starting Kafka
transaction, but not saving its TID and then crashing, then figuring out
which transactions to abort and etc.

> 2) creating a separate producer for each concurrency slot in the
application

This is a very valid concern.  Maybe we'd need to have some multiplexing of
transactional logical "streams" over the same connection.  Seems like a
separate KIP, though.

> Otherwise, it seems you're left with single-threaded model per
application process?

That's a fair assessment.  Not necessarily exactly single-threaded per
application, but a single producer per thread model (i.e. an application
could have a pool of threads + producers to increase concurrency).

-Artem

On Tue, Aug 22, 2023 at 7:22 PM Roger Hoover  wrote:

> Artem,
>
> Thanks for the reply.
>
> If I understand correctly, Kafka does not support concurrent transactions
> from the same producer (transactional id).  I think this means that
> applications that want to support in-process concurrency (say thread-level
> concurrency with row-level DB locking) would need to manage separate
> transactional ids and producers per thread and then store txn state
> accordingly.   The potential usability downsides I see are
> 1) managing a set of transactional ids for each application process that
> scales up to it's max concurrency.  Maybe not too bad but a bit of pain to
> manage these ids inside each process and across all application processes.
> 2) creating a separate producer for each concurrency slot in the
> application - this could create a lot more producers and resultant
> connections to Kafka than the typical model of a single producer per
> process.
>
> Otherwise, it seems you're left with single-threaded model per application
> process?
>
> Thanks,
>
> Roger
>
> On Tue, Aug 22, 2023 at 5:11 PM Artem Livshits
>  wrote:
>
> > Hi Roger, Arjun,
> >
> > Thank you for the questions.
> > > It looks like the application must have stable transactional ids over
> > time?
> >
> > The transactional id should uniquely identify a producer instance and
> needs
> > to be stable across the restarts.  If the transactional id is not stable
> > across restarts, then zombie messages from a previous incarnation of the
> > producer may violate atomicity.  If there are 2 producer instances
> > concurrently producing data with the same transactional id, they are
> going
> > to constantly fence each other and most likely make little or no
> progress.
> >
> > The name might be a little bit confusing as it may be mistaken for a
> > transaction id / TID that uniquely identifies every transaction.  The
> name
> > and the semantics were defined in the original exactly-once-semantics
> (EoS)
> > proposal (
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
> > )
> > and KIP-939 just build on top of that.
> >
> > > I'm curious to understand what happens if the producer dies, and does
> not
> > come up and recover the pending transaction within the transaction
> timeout
> > interval.
> >
> > If the producer / application never comes back, the transaction will
> remain
> > in prepared (a.k.a. "in-doubt") state until an operator forcefully
> > terminates the transaction.  That's why there is a new ACL is defined in
> > this proposal -- this functionality should only provided to applications
> > that implement proper recovery logic.
> >
> > -Artem
> >
> > On Tue, Aug 22, 2023 at 12:52 AM Arjun Satish 
> > wrote:
> >
> > > Hello Artem,
> > >
> > > Thanks for the KIP.
> > >
> &

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-08-22 Thread Artem Livshits
Hi Roger, Arjun,

Thank you for the questions.
> It looks like the application must have stable transactional ids over
time?

The transactional id should uniquely identify a producer instance and needs
to be stable across the restarts.  If the transactional id is not stable
across restarts, then zombie messages from a previous incarnation of the
producer may violate atomicity.  If there are 2 producer instances
concurrently producing data with the same transactional id, they are going
to constantly fence each other and most likely make little or no progress.

The name might be a little bit confusing as it may be mistaken for a
transaction id / TID that uniquely identifies every transaction.  The name
and the semantics were defined in the original exactly-once-semantics (EoS)
proposal (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging)
and KIP-939 just build on top of that.

> I'm curious to understand what happens if the producer dies, and does not
come up and recover the pending transaction within the transaction timeout
interval.

If the producer / application never comes back, the transaction will remain
in prepared (a.k.a. "in-doubt") state until an operator forcefully
terminates the transaction.  That's why there is a new ACL is defined in
this proposal -- this functionality should only provided to applications
that implement proper recovery logic.

-Artem

On Tue, Aug 22, 2023 at 12:52 AM Arjun Satish 
wrote:

> Hello Artem,
>
> Thanks for the KIP.
>
> I have the same question as Roger on concurrent writes, and an additional
> one on consumer behavior. Typically, transactions will timeout if not
> committed within some time interval. With the proposed changes in this KIP,
> consumers cannot consume past the ongoing transaction. I'm curious to
> understand what happens if the producer dies, and does not come up and
> recover the pending transaction within the transaction timeout interval. Or
> are we saying that when used in this 2PC context, we should configure these
> transaction timeouts to very large durations?
>
> Thanks in advance!
>
> Best,
> Arjun
>
>
> On Mon, Aug 21, 2023 at 1:06 PM Roger Hoover 
> wrote:
>
> > Hi Artem,
> >
> > Thanks for writing this KIP.  Can you clarify the requirements a bit more
> > for managing transaction state?  It looks like the application must have
> > stable transactional ids over time?   What is the granularity of those
> ids
> > and producers?  Say the application is a multi-threaded Java web server,
> > can/should all the concurrent threads share a transactional id and
> > producer?  That doesn't seem right to me unless the application is using
> > global DB locks that serialize all requests.  Instead, if the application
> > uses row-level DB locks, there could be multiple, concurrent, independent
> > txns happening in the same JVM so it seems like the granularity managing
> > transactional ids and txn state needs to line up with granularity of the
> DB
> > locking.
> >
> > Does that make sense or am I misunderstanding?
> >
> > Thanks,
> >
> > Roger
> >
> > On Wed, Aug 16, 2023 at 11:40 PM Artem Livshits
> >  wrote:
> >
> > > Hello,
> > >
> > > This is a discussion thread for
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC
> > > .
> > >
> > > The KIP proposes extending Kafka transaction support (that already uses
> > 2PC
> > > under the hood) to enable atomicity of dual writes to Kafka and an
> > external
> > > database, and helps to fix a long standing Flink issue.
> > >
> > > An example of code that uses the dual write recipe with JDBC and should
> > > work for most SQL databases is here
> > > https://github.com/apache/kafka/pull/14231.
> > >
> > > The FLIP for the sister fix in Flink is here
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=255071710
> > >
> > > -Artem
> > >
> >
>


[DISCUSS] KIP-939: Support Participation in 2PC

2023-08-17 Thread Artem Livshits
Hello,

This is a discussion thread for
https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC
.

The KIP proposes extending Kafka transaction support (that already uses 2PC
under the hood) to enable atomicity of dual writes to Kafka and an external
database, and helps to fix a long standing Flink issue.

An example of code that uses the dual write recipe with JDBC and should
work for most SQL databases is here
https://github.com/apache/kafka/pull/14231.

The FLIP for the sister fix in Flink is here
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=255071710

-Artem


[jira] [Created] (KAFKA-15370) Support Participation in 2PC (KIP-939)

2023-08-16 Thread Artem Livshits (Jira)
Artem Livshits created KAFKA-15370:
--

 Summary: Support Participation in 2PC (KIP-939)
 Key: KAFKA-15370
 URL: https://issues.apache.org/jira/browse/KAFKA-15370
 Project: Kafka
  Issue Type: Improvement
Reporter: Artem Livshits


This ticket tracks the implementation of KIP-939: 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-903: Replicas with stale broker epoch should not be allowed to join the ISR

2023-02-07 Thread Artem Livshits
Hi Calvin,

Thank you for the KIP.  I have a similar question -- we need to support
rolling upgrades (when we have some old brokers and some new brokers), so
there could be combinations of old leader - new follower, new leader - old
follower, new leader - old controller, old leader - new controller.  Could
you elaborate on the behavior during rolls in the Compatibility section?

Also for compatibility it's probably going to be easier to just add a new
array of epochs in addition to the existing array of broker ids, instead of
removing one field and adding another.

The KIP mentions that we would explicitly do something special in ZK mode
in order to not implement new functionality.  I think it may be easier to
implement functionality for both ZK and KRraft mode than adding code to
disable it in ZK mode.

-Artem

On Tue, Feb 7, 2023 at 4:58 PM Jun Rao  wrote:

> Hi, Calvin,
>
> Thanks for the KIP. Looks good to me overall.
>
> Since this KIP changes the inter-broker protocol, should we bump up the
> metadata version (the equivalent of IBP) during upgrade?
>
> Jun
>
>
> On Fri, Feb 3, 2023 at 10:55 AM Calvin Liu 
> wrote:
>
> > Hi everyone,
> > I'd like to discuss the fix for the broker reboot data loss KAFKA-14139
> > .
> > It changes the Fetch and AlterPartition requests to include the broker
> > epochs. Then the controller can use these epochs to help reject the stale
> > AlterPartition request.
> > Please take a look. Thanks!
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-903%3A+Replicas+with+stale+broker+epoch+should+not+be+allowed+to+join+the+ISR
> >
>


Re: [VOTE] KIP-890: Transactions Server Side Defense

2023-02-02 Thread Artem Livshits
(non-binding) +1.  Looking forward to the implementation and fixing the
issues that we've got.

-Artem

On Mon, Jan 23, 2023 at 2:25 PM Guozhang Wang 
wrote:

> Thanks Justine, I have no further comments on the KIP. +1.
>
> On Tue, Jan 17, 2023 at 10:34 AM Jason Gustafson
>  wrote:
> >
> > +1. Thanks Justine!
> >
> > -Jason
> >
> > On Tue, Jan 10, 2023 at 3:46 PM Colt McNealy 
> wrote:
> >
> > > (non-binding) +1. Thank you for the KIP, Justine! I've read it; it
> makes
> > > sense to me and I am excited for the implementation.
> > >
> > > Colt McNealy
> > > *Founder, LittleHorse.io*
> > >
> > >
> > > On Tue, Jan 10, 2023 at 10:46 AM Justine Olshan
> > >  wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I would like to start a vote on KIP-890 which aims to prevent some
> of the
> > > > common causes of hanging transactions and make other general
> improvements
> > > > to transactions in Kafka.
> > > >
> > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense
> > > >
> > > > Please take a look if you haven't already and vote!
> > > >
> > > > Justine
> > > >
> > >
>


Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-20 Thread Artem Livshits
>  looks like we already have code to handle bumping the epoch and
when the epoch is Short.MAX_VALUE, we get a new producer ID.

My understanding is that this logic is currently encapsulated in the broker
and the client doesn't really know at which epoch value the new producer id
is generated.  With the new protocol, the client would need to be aware.
We don't need to change the logic, just document it.  With our
implementation, once epoch reaches Short.MAX_VALUE it cannot be used
further, but a naïve client implementer may miss this point and it may be
missed in testing if the tests don't overflow the epoch, and then once they
hit the issue, it's not immediately obvious from the KIP how to handle it.
Explicitly documenting this point in the KIP would help to avoid (or
quickly resolve) such issues.

-Artem

On Wed, Jan 18, 2023 at 3:01 PM Justine Olshan 
wrote:

> Yeah -- looks like we already have code to handle bumping the epoch and
> when the epoch is Short.MAX_VALUE, we get a new producer ID. Since this is
> already the behavior, do we want to change it further?
>
> Justine
>
> On Wed, Jan 18, 2023 at 1:12 PM Justine Olshan 
> wrote:
>
> > Hey all, just wanted to quickly update and say I've modified the KIP to
> > explicitly mention that AddOffsetCommitsToTxnRequest will be replaced by
> > a coordinator-side (inter-broker) AddPartitionsToTxn implicit request.
> This
> > mirrors the user partitions and will implicitly add offset partitions to
> > transactions when we commit offsets on them. We will deprecate
> AddOffsetCommitsToTxnRequest
> > for new clients.
> >
> > Also to address Artem's comments --
> > I'm a bit unsure if the changes here will change the previous behavior
> for
> > fencing producers. In the case you mention in the first paragraph, are
> you
> > saying we bump the epoch before we try to abort the transaction? I think
> I
> > need to understand the scenarios you mention a bit better.
> >
> > As for the second part -- I think it makes sense to have some sort of
> > "sentinel" epoch to signal epoch is about to overflow (I think we sort of
> > have this value in place in some ways) so we can codify it in the KIP.
> I'll
> > look into that and try to update soon.
> >
> > Thanks,
> > Justine.
> >
> > On Fri, Jan 13, 2023 at 5:01 PM Artem Livshits
> >  wrote:
> >
> >> It's good to know that KIP-588 addressed some of the issues.  Looking at
> >> the code, it still looks like there are some cases that would result in
> >> fatal error, e.g. PRODUCER_FENCED is issued by the transaction
> coordinator
> >> if epoch doesn't match, and the client treats it as a fatal error (code
> in
> >> TransactionManager request handling).  If we consider, for example,
> >> committing a transaction that returns a timeout, but actually succeeds,
> >> trying to abort it or re-commit may result in PRODUCER_FENCED error
> >> (because of epoch bump).
> >>
> >> For failed commits, specifically, we need to know the actual outcome,
> >> because if we return an error the application may think that the
> >> transaction is aborted and redo the work, leading to duplicates.
> >>
> >> Re: overflowing epoch.  We could either do it on the TC and return both
> >> producer id and epoch (e.g. change the protocol), or signal the client
> >> that
> >> it needs to get a new producer id.  Checking for max epoch could be a
> >> reasonable signal, the value to check should probably be present in the
> >> KIP
> >> as this is effectively a part of the contract.  Also, the TC should
> >> probably return an error if the client didn't change producer id after
> >> hitting max epoch.
> >>
> >> -Artem
> >>
> >>
> >> On Thu, Jan 12, 2023 at 10:31 AM Justine Olshan
> >>  wrote:
> >>
> >> > Thanks for the discussion Artem.
> >> >
> >> > With respect to the handling of fenced producers, we have some
> behavior
> >> > already in place. As of KIP-588:
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-588%3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts
> >> > ,
> >> > we handle timeouts more gracefully. The producer can recover.
> >> >
> >> > Produce requests can also recover from epoch fencing by aborting the
> >> > transaction and starting over.
> >> >
> >> > What other cases were you considering that would cause us to have a
> >> fenced
> >>

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-13 Thread Artem Livshits
It's good to know that KIP-588 addressed some of the issues.  Looking at
the code, it still looks like there are some cases that would result in
fatal error, e.g. PRODUCER_FENCED is issued by the transaction coordinator
if epoch doesn't match, and the client treats it as a fatal error (code in
TransactionManager request handling).  If we consider, for example,
committing a transaction that returns a timeout, but actually succeeds,
trying to abort it or re-commit may result in PRODUCER_FENCED error
(because of epoch bump).

For failed commits, specifically, we need to know the actual outcome,
because if we return an error the application may think that the
transaction is aborted and redo the work, leading to duplicates.

Re: overflowing epoch.  We could either do it on the TC and return both
producer id and epoch (e.g. change the protocol), or signal the client that
it needs to get a new producer id.  Checking for max epoch could be a
reasonable signal, the value to check should probably be present in the KIP
as this is effectively a part of the contract.  Also, the TC should
probably return an error if the client didn't change producer id after
hitting max epoch.

-Artem


On Thu, Jan 12, 2023 at 10:31 AM Justine Olshan
 wrote:

> Thanks for the discussion Artem.
>
> With respect to the handling of fenced producers, we have some behavior
> already in place. As of KIP-588:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-588%3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts
> ,
> we handle timeouts more gracefully. The producer can recover.
>
> Produce requests can also recover from epoch fencing by aborting the
> transaction and starting over.
>
> What other cases were you considering that would cause us to have a fenced
> epoch but we'd want to recover?
>
> The first point about handling epoch overflows is fair. I think there is
> some logic we'd need to consider. (ie, if we are one away from the max
> epoch, we need to reset the producer ID.) I'm still wondering if there is a
> way to direct this from the response, or if everything should be done on
> the client side. Let me know if you have any thoughts here.
>
> Thanks,
> Justine
>
> On Tue, Jan 10, 2023 at 4:06 PM Artem Livshits
>  wrote:
>
> > There are some workflows in the client that are implied by protocol
> > changes, e.g.:
> >
> > - for new clients, epoch changes with every transaction and can overflow,
> > in old clients this condition was handled transparently, because epoch
> was
> > bumped in InitProducerId and it would return a new producer id if epoch
> > overflows, the new clients would need to implement some workflow to
> refresh
> > producer id
> > - how to handle fenced producers, for new clients epoch changes with
> every
> > transaction, so in presence of failures during commits / aborts, the
> > producer could get easily fenced, old clients would pretty much would get
> > fenced when a new incarnation of the producer was initialized with
> > InitProducerId so it's ok to treat as a fatal error, the new clients
> would
> > need to implement some workflow to handle that error, otherwise they
> could
> > get fenced by themselves
> > - in particular (as a subset of the previous issue), what would the
> client
> > do if it got a timeout during commit?  commit could've succeeded or
> failed
> >
> > Not sure if this has to be defined in the KIP as implementing those
> > probably wouldn't require protocol changes, but we have multiple
> > implementations of Kafka clients, so probably would be good to have some
> > client implementation guidance.  Could also be done as a separate doc.
> >
> > -Artem
> >
> > On Mon, Jan 9, 2023 at 3:38 PM Justine Olshan
>  > >
> > wrote:
> >
> > > Hey all, I've updated the KIP to incorporate Jason's suggestions.
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense
> > >
> > >
> > > 1. Use AddPartitionsToTxn + verify flag to check on old clients
> > > 2. Updated AddPartitionsToTxn API to support transaction batching
> > > 3. Mention IBP bump
> > > 4. Mention auth change on new AddPartitionsToTxn version.
> > >
> > > I'm planning on opening a vote soon.
> > > Thanks,
> > > Justine
> > >
> > > On Fri, Jan 6, 2023 at 3:32 PM Justine Olshan 
> > > wrote:
> > >
> > > > Thanks Jason. Those changes make sense to me. I will update the KIP.
> > > >
> > > >
> > > >
> > > > On Fri, Jan 6, 2023 at 3:31 PM Jason Gustafson

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-10 Thread Artem Livshits
gt; > documentation purpose?
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > 40: I think you hit a fair point about race
> >> conditions
> >> > or
> >> > >> > > client
> >> > >> > > > > bugs
> >> > >> > > > > > > > > (incorrectly not bumping the epoch). The
> >> > >> complexity/confusion
> >> > >> > > for
> >> > >> > > > > > using
> >> > >> > > > > > > > > the bumped epoch I see, is mainly for internal
> >> > debugging,
> >> > >> ie,
> >> > >> > > > > > > inspecting
> >> > >> > > > > > > > > log segment dumps -- it seems harder to reason
> about
> >> the
> >> > >> > system
> >> > >> > > > for
> >> > >> > > > > > us
> >> > >> > > > > > > > > humans. But if we get better guarantees, it would
> be
> >> > >> worth to
> >> > >> > > use
> >> > >> > > > > the
> >> > >> > > > > > > > > bumped epoch.
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > 60: as I mentioned already, I don't know the broker
> >> > >> internals
> >> > >> > > to
> >> > >> > > > > > > provide
> >> > >> > > > > > > > > more input. So if nobody else chimes in, we should
> >> just
> >> > >> move
> >> > >> > > > > forward
> >> > >> > > > > > > > > with your proposal.
> >> > >> > > > > > > > >
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > -Matthias
> >> > >> > > > > > > > >
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > On 12/6/22 4:22 PM, Justine Olshan wrote:
> >> > >> > > > > > > > > > Hi all,
> >> > >> > > > > > > > > > After Artem's questions about error behavior,
> I've
> >> > >> > > re-evaluated
> >> > >> > > > > the
> >> > >> > > > > > > > > > unknown producer ID exception and had some
> >> discussions
> >> > >> > > offline.
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > > > I think generally it makes sense to simplify
> error
> >> > >> handling
> >> > >> > > in
> >> > >> > > > > > cases
> >> > >> > > > > > > > like
> >> > >> > > > > > > > > > this and the UNKNOWN_PRODUCER_ID error has a
> pretty
> >> > long
> >> > >> > and
> >> > >> > > > > > > > complicated
> >> > >> > > > > > > > > > history. Because of this, I propose adding a new
> >> error
> >> > >> code
> >> > >> > > > > > > > > ABORTABLE_ERROR
> >> > >> > > > > > > > > > that when encountered by new clients (gated by
> the
> >> > >> produce
> >> > >> > > > > request
> >> > >> > > > > > > > > version)
> >> > >> > > > > > > > > > will simply abort the transaction. This allows
> the
> >> > >> server
> >> > >> > to
> >> > >> > > > have
> >> > >> > > > > > > some
> >> > >> > > > > > > > > say
> >> > >> > > > > > > > > > in whether the client aborts and makes handling
> >> much
> >> > >> > simpler.
> >> > >> > > > In
> >> > >> > > > > > the
> >> > >> > > > > > > > &g

Re: [DISCUSS] KIP-890 Server Side Defense

2022-11-30 Thread Artem Livshits
Jeff's second point:
> *does the typical produce request path append records to local log along*
>
> *with the currentTxnFirstOffset information? I would like to understand*
>
> *when the field is written to disk.*
>
>
> Yes, the first produce request populates this field and writes the offset
> as part of the record batch and also to the producer state snapshot. When
> we reload the records on restart and/or reassignment, we repopulate this
> field with the snapshot from disk along with the rest of the producer
> state.
>
> Let me know if there are further comments and/or questions.
>
> Thanks,
> Justine
>
> On Tue, Nov 22, 2022 at 9:00 PM Jeff Kim 
> wrote:
>
> > Hi Justine,
> >
> > Thanks for the KIP! I have two questions:
> >
> > 1) For new clients, we can once again return an error UNKNOWN_PRODUCER_ID
> > for sequences
> > that are non-zero when there is no producer state present on the server.
> > This will indicate we missed the 0 sequence and we don't yet want to
> write
> > to the log.
> >
> > I would like to understand the current behavior to handle older clients,
> > and if there are any changes we are making. Maybe I'm missing something,
> > but we would want to identify whether we missed the 0 sequence for older
> > clients, no?
> >
> > 2) Upon returning from the transaction coordinator, we can set the
> > transaction
> > as ongoing on the leader by populating currentTxnFirstOffset
> > through the typical produce request handling.
> >
> > does the typical produce request path append records to local log along
> > with the currentTxnFirstOffset information? I would like to understand
> > when the field is written to disk.
> >
> > Thanks,
> > Jeff
> >
> >
> > On Tue, Nov 22, 2022 at 4:44 PM Artem Livshits
> >  wrote:
> >
> > > Hi Justine,
> > >
> > > Thank you for the KIP.  I have one question.
> > >
> > > 5) For new clients, we can once again return an error
> UNKNOWN_PRODUCER_ID
> > >
> > > I believe we had problems in the past with returning
> UNKNOWN_PRODUCER_ID
> > > because it was considered fatal and required client restart.  It would
> be
> > > good to spell out the new client behavior when it receives the error.
> > >
> > > -Artem
> > >
> > > On Tue, Nov 22, 2022 at 10:00 AM Justine Olshan
> > >  wrote:
> > >
> > > > Thanks for taking a look Matthias. I've tried to answer your
> questions
> > > > below:
> > > >
> > > > 10)
> > > >
> > > > Right — so the hanging transaction only occurs when we have a late
> > > message
> > > > come in and the partition is never added to a transaction again. If
> we
> > > > never add the partition to a transaction, we will never write a
> marker
> > > and
> > > > never advance the LSO.
> > > >
> > > > If we do end up adding the partition to the transaction (I suppose
> this
> > > can
> > > > happen before or after the late message comes in) then we will
> include
> > > the
> > > > late message in the next (incorrect) transaction.
> > > >
> > > > So perhaps it is clearer to make the distinction between messages
> that
> > > > eventually get added to the transaction (but the wrong one) or
> messages
> > > > that never get added and become hanging.
> > > >
> > > >
> > > > 20)
> > > >
> > > > The client side change for 2 is removing the addPartitions to
> > transaction
> > > > call. We don't need to make this from the producer to the txn
> > > coordinator,
> > > > only server side.
> > > >
> > > >
> > > > In my opinion, the issue with the addPartitionsToTxn call for older
> > > clients
> > > > is that we don't have the epoch bump, so we don't know if the message
> > > > belongs to the previous transaction or this one. We need to check if
> > the
> > > > partition has been added to this transaction. Of course, this means
> we
> > > > won't completely cover the case where we have a really late message
> and
> > > we
> > > > have added the partition to the new transaction, but that's
> > unfortunately
> > > > something we will need the new clients to cover.
> > > >
> > > >
> > > > 30)
> > > >
> > > > Transaction is ongoing = partit

Re: [DISCUSS] KIP-881: Rack-aware Partition Assignment for Kafka Consumers

2022-11-30 Thread Artem Livshits
I think it's reasonable for practical scenarios.  Is it possible to turn
off rack awareness in the clients in case the broker selector plugin is not
compatible with rack-aware logic in the client?

-Artem

On Wed, Nov 30, 2022 at 2:46 AM Rajini Sivaram 
wrote:

> Hi Artem,
>
> Understood your concern - brokers could use a replica selector that uses
> some other client metadata other than rack id to decide the preferred read
> replica, or just use the leader. In that case, consumers wouldn't really
> benefit from aligning partition assignment on rack ids. So the question is
> whether we should make the default consumer assignors use rack ids for
> partition assignment or whether we should have different rack-aware
> assignors that can be configured when brokers use rack-aware replica
> selector. We had a similar discussion earlier in the thread (the KIP had
> originally proposed new rack-aware assignors). We decided to update the
> default assignors because:
> 1) Consumers using fetch-from-follower automatically benefit from improved
> locality, without having to update another config.
> 2) Consumers configure rack id for improved locality, so aligning on
> replica rack ids generally makes sense.
> 3) We prioritize balanced assignment over locality in the consumer, so the
> default assignors should work effectively regardless of broker's replica
> selector.
>
> Does that make sense?
>
>
> Thank you,
>
> Rajini
>
>
>
> On Tue, Nov 29, 2022 at 1:05 AM Artem Livshits
>  wrote:
>
> > I'm thinking of a case where the broker uses a plugin that does not use
> > rack ids to determine client locality, in that case the consumer might
> > arrive at the wrong decision based on rack ids.
> >
> > -Artem
> >
> > On Wed, Nov 23, 2022 at 3:54 AM Rajini Sivaram 
> > wrote:
> >
> > > Hi Artem,
> > >
> > > Thanks for reviewing the KIP. The client doesn't rely on a particular
> > > replica assignment logic in the broker. Instead, it matches the actual
> > rack
> > > assignment for partitions from the metadata with consumer racks. So
> there
> > > is no assumption in the client implementation about the use of
> > > RackAwareReplicaSelector in the broker. Did I misunderstand your
> concern?
> > >
> > > Regards,
> > >
> > > Rajini
> > >
> > >
> > > On Tue, Nov 22, 2022 at 11:03 PM Artem Livshits
> > >  wrote:
> > >
> > > > Hi Rajini,
> > > >
> > > > Thank you for the KIP, the KIP looks good to match
> > > RackAwareReplicaSelector
> > > > behavior that is available out-of-box.  Which should probably be good
> > > > enough in practice.
> > > >
> > > > From the design perspective, though, RackAwareReplicaSelector is just
> > one
> > > > possible plugin, in theory the broker could use a plugin that
> leverages
> > > > networking information to get client locality or some other way, so
> it
> > > > seems like we're making an assumption about broker replica selection
> in
> > > the
> > > > default assignment implementation.  So I wonder if we should use a
> > > separate
> > > > plugin that would be set when RackAwareReplicaSelector is set, rather
> > > than
> > > > assume broker behavior in the client implementation.
> > > >
> > > > -Artem
> > > >
> > > > On Wed, Nov 16, 2022 at 8:08 AM Jun Rao 
> > > wrote:
> > > >
> > > > > Hi, David and Rajini,
> > > > >
> > > > > Thanks for the explanation. It makes sense to me now.
> > > > >
> > > > > Jun
> > > > >
> > > > > On Wed, Nov 16, 2022 at 1:44 AM Rajini Sivaram <
> > > rajinisiva...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks David, that was my understanding as well.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Rajini
> > > > > >
> > > > > > On Wed, Nov 16, 2022 at 8:08 AM David Jacot
> > > >  > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jun,
> > > > > > >
> > > > > > > We don't need to bump any RPC requests. The subscription is
> > > > serialized
> > > > > > > (including its version) and included as bytes in the RPCs.
> > > > > >

Re: [DISCUSS] KIP-881: Rack-aware Partition Assignment for Kafka Consumers

2022-11-28 Thread Artem Livshits
I'm thinking of a case where the broker uses a plugin that does not use
rack ids to determine client locality, in that case the consumer might
arrive at the wrong decision based on rack ids.

-Artem

On Wed, Nov 23, 2022 at 3:54 AM Rajini Sivaram 
wrote:

> Hi Artem,
>
> Thanks for reviewing the KIP. The client doesn't rely on a particular
> replica assignment logic in the broker. Instead, it matches the actual rack
> assignment for partitions from the metadata with consumer racks. So there
> is no assumption in the client implementation about the use of
> RackAwareReplicaSelector in the broker. Did I misunderstand your concern?
>
> Regards,
>
> Rajini
>
>
> On Tue, Nov 22, 2022 at 11:03 PM Artem Livshits
>  wrote:
>
> > Hi Rajini,
> >
> > Thank you for the KIP, the KIP looks good to match
> RackAwareReplicaSelector
> > behavior that is available out-of-box.  Which should probably be good
> > enough in practice.
> >
> > From the design perspective, though, RackAwareReplicaSelector is just one
> > possible plugin, in theory the broker could use a plugin that leverages
> > networking information to get client locality or some other way, so it
> > seems like we're making an assumption about broker replica selection in
> the
> > default assignment implementation.  So I wonder if we should use a
> separate
> > plugin that would be set when RackAwareReplicaSelector is set, rather
> than
> > assume broker behavior in the client implementation.
> >
> > -Artem
> >
> > On Wed, Nov 16, 2022 at 8:08 AM Jun Rao 
> wrote:
> >
> > > Hi, David and Rajini,
> > >
> > > Thanks for the explanation. It makes sense to me now.
> > >
> > > Jun
> > >
> > > On Wed, Nov 16, 2022 at 1:44 AM Rajini Sivaram <
> rajinisiva...@gmail.com>
> > > wrote:
> > >
> > > > Thanks David, that was my understanding as well.
> > > >
> > > > Regards,
> > > >
> > > > Rajini
> > > >
> > > > On Wed, Nov 16, 2022 at 8:08 AM David Jacot
> >  > > >
> > > > wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > We don't need to bump any RPC requests. The subscription is
> > serialized
> > > > > (including its version) and included as bytes in the RPCs.
> > > > >
> > > > > Best,
> > > > > David
> > > > >
> > > > > On Tue, Nov 15, 2022 at 11:42 PM Jun Rao  >
> > > > wrote:
> > > > > >
> > > > > > Hi, Rajini,
> > > > > >
> > > > > > Thanks for the updated KIP. Just another minor comment. It would
> be
> > > > > useful
> > > > > > to list all RPC requests whose version needs to be bumped because
> > of
> > > > the
> > > > > > changes in ConsumerProtocolSubscription.
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, Nov 15, 2022 at 3:45 AM Rajini Sivaram <
> > > > rajinisiva...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi David,
> > > > > > >
> > > > > > > Sorry, I was out of office and hence the delay in responding.
> > > Thanks
> > > > > for
> > > > > > > reviewing the KIP and answering Viktor's question (thanks for
> the
> > > > > review,
> > > > > > > Viktor).
> > > > > > >
> > > > > > > Responses below:
> > > > > > > 01)  I was in two minds about adding new assignors, because as
> > you
> > > > > said,
> > > > > > > user experience is better if assignors used racks when
> available.
> > > But
> > > > > I was
> > > > > > > a bit concerned about changing the algorithm in existing
> > > applications
> > > > > which
> > > > > > > were already configuring `client.rack`. It felt less risky to
> add
> > > new
> > > > > > > assignor implementations instead. But we can retain existing
> > logic
> > > if
> > > > > a)
> > > > > > > rack information is not available and b) racks have all
> > partitions.
> > > > So
> > > > > the
> > > > > > > only case where

Re: [DISCUSS] KIP-890 Server Side Defense

2022-11-22 Thread Artem Livshits
Hi Justine,

Thank you for the KIP.  I have one question.

5) For new clients, we can once again return an error UNKNOWN_PRODUCER_ID

I believe we had problems in the past with returning UNKNOWN_PRODUCER_ID
because it was considered fatal and required client restart.  It would be
good to spell out the new client behavior when it receives the error.

-Artem

On Tue, Nov 22, 2022 at 10:00 AM Justine Olshan
 wrote:

> Thanks for taking a look Matthias. I've tried to answer your questions
> below:
>
> 10)
>
> Right — so the hanging transaction only occurs when we have a late message
> come in and the partition is never added to a transaction again. If we
> never add the partition to a transaction, we will never write a marker and
> never advance the LSO.
>
> If we do end up adding the partition to the transaction (I suppose this can
> happen before or after the late message comes in) then we will include the
> late message in the next (incorrect) transaction.
>
> So perhaps it is clearer to make the distinction between messages that
> eventually get added to the transaction (but the wrong one) or messages
> that never get added and become hanging.
>
>
> 20)
>
> The client side change for 2 is removing the addPartitions to transaction
> call. We don't need to make this from the producer to the txn coordinator,
> only server side.
>
>
> In my opinion, the issue with the addPartitionsToTxn call for older clients
> is that we don't have the epoch bump, so we don't know if the message
> belongs to the previous transaction or this one. We need to check if the
> partition has been added to this transaction. Of course, this means we
> won't completely cover the case where we have a really late message and we
> have added the partition to the new transaction, but that's unfortunately
> something we will need the new clients to cover.
>
>
> 30)
>
> Transaction is ongoing = partition was added to transaction via
> addPartitionsToTxn. We check this with the DescribeTransactions call. Let
> me know if this wasn't sufficiently explained here:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense#KIP890:TransactionsServerSideDefense-EnsureOngoingTransactionforOlderClients(3)
>
>
> 40)
>
> The idea here is that if any messages somehow come in before we get the new
> epoch to the producer, they will be fenced. However, if we don't think this
> is necessary, it can be discussed
>
>
> 50)
>
> It should be synchronous because if we have an event (ie, an error) that
> causes us to need to abort the transaction, we need to know which
> partitions to send transaction markers to. We know the partitions because
> we added them to the coordinator via the addPartitionsToTxn call.
> Previously we have had asynchronous calls in the past (ie, writing the
> commit markers when the transaction is completed) but often this just
> causes confusion as we need to wait for some operations to complete. In the
> writing commit markers case, clients often see CONCURRENT_TRANSACTIONs
> error messages and that can be confusing. For that reason, it may be
> simpler to just have synchronous calls — especially if we need to block on
> some operation's completion anyway before we can start the next
> transaction. And yes, I meant coordinator. I will fix that.
>
>
> 60)
>
> When we are checking if the transaction is ongoing, we need to make a round
> trip from the leader partition to the transaction coordinator. In the time
> we are waiting for this message to come back, in theory we could have sent
> a commit/abort call that would make the original result of the check out of
> date. That is why we can check the leader state before we write to the log.
>
>
> I'm happy to update the KIP if some of these things were not clear.
> Thanks,
> Justine
>
> On Mon, Nov 21, 2022 at 7:11 PM Matthias J. Sax  wrote:
>
> > Thanks for the KIP.
> >
> > Couple of clarification questions (I am not a broker expert do maybe
> > some question are obvious for others, but not for me with my lack of
> > broker knowledge).
> >
> >
> >
> > (10)
> >
> > > The delayed message case can also violate EOS if the delayed message
> > comes in after the next addPartitionsToTxn request comes in. Effectively
> we
> > may see a message from a previous (aborted) transaction become part of
> the
> > next transaction.
> >
> > What happens if the message come in before the next addPartitionsToTxn
> > request? It seems the broker hosting the data partitions won't know
> > anything about it and append it to the partition, too? What is the
> > difference between both cases?
> >
> > Also, it seems a TX would only hang, if there is no following TX that is
> > either committer or aborted? Thus, for the case above, the TX might
> > actually not hang (of course, we might get an EOS violation if the first
> > TX was aborted and the second committed, or the other way around).
> >
> >
> > (20)
> >
> > > Of course, 1 and 2 require client-side changes, so for older 

Re: [DISCUSS] KIP-881: Rack-aware Partition Assignment for Kafka Consumers

2022-11-22 Thread Artem Livshits
Hi Rajini,

Thank you for the KIP, the KIP looks good to match RackAwareReplicaSelector
behavior that is available out-of-box.  Which should probably be good
enough in practice.

>From the design perspective, though, RackAwareReplicaSelector is just one
possible plugin, in theory the broker could use a plugin that leverages
networking information to get client locality or some other way, so it
seems like we're making an assumption about broker replica selection in the
default assignment implementation.  So I wonder if we should use a separate
plugin that would be set when RackAwareReplicaSelector is set, rather than
assume broker behavior in the client implementation.

-Artem

On Wed, Nov 16, 2022 at 8:08 AM Jun Rao  wrote:

> Hi, David and Rajini,
>
> Thanks for the explanation. It makes sense to me now.
>
> Jun
>
> On Wed, Nov 16, 2022 at 1:44 AM Rajini Sivaram 
> wrote:
>
> > Thanks David, that was my understanding as well.
> >
> > Regards,
> >
> > Rajini
> >
> > On Wed, Nov 16, 2022 at 8:08 AM David Jacot  >
> > wrote:
> >
> > > Hi Jun,
> > >
> > > We don't need to bump any RPC requests. The subscription is serialized
> > > (including its version) and included as bytes in the RPCs.
> > >
> > > Best,
> > > David
> > >
> > > On Tue, Nov 15, 2022 at 11:42 PM Jun Rao 
> > wrote:
> > > >
> > > > Hi, Rajini,
> > > >
> > > > Thanks for the updated KIP. Just another minor comment. It would be
> > > useful
> > > > to list all RPC requests whose version needs to be bumped because of
> > the
> > > > changes in ConsumerProtocolSubscription.
> > > >
> > > > Jun
> > > >
> > > > On Tue, Nov 15, 2022 at 3:45 AM Rajini Sivaram <
> > rajinisiva...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi David,
> > > > >
> > > > > Sorry, I was out of office and hence the delay in responding.
> Thanks
> > > for
> > > > > reviewing the KIP and answering Viktor's question (thanks for the
> > > review,
> > > > > Viktor).
> > > > >
> > > > > Responses below:
> > > > > 01)  I was in two minds about adding new assignors, because as you
> > > said,
> > > > > user experience is better if assignors used racks when available.
> But
> > > I was
> > > > > a bit concerned about changing the algorithm in existing
> applications
> > > which
> > > > > were already configuring `client.rack`. It felt less risky to add
> new
> > > > > assignor implementations instead. But we can retain existing logic
> if
> > > a)
> > > > > rack information is not available and b) racks have all partitions.
> > So
> > > the
> > > > > only case where logic will be different is when rack information is
> > > > > available because consumers chose to use `client.rack` to benefit
> > from
> > > > > improved locality, but racks only have a subset of partitions. It
> > seems
> > > > > reasonable to make existing assignors rack-aware in this case to
> > > improve
> > > > > locality. I have updated the KIP. Will wait and see if there are
> any
> > > > > objections to this change.
> > > > >
> > > > > 02) Updated 1), so existing assignor classes will be used.
> > > > >
> > > > > 03) Updated the KIP to use version 3, thanks.
> > > > >
> > > > > If there are no concerns or further comments, I will start voting
> > later
> > > > > today.
> > > > >
> > > > > Thank you,
> > > > >
> > > > > Rajini
> > > > >
> > > > >
> > > > > On Fri, Nov 4, 2022 at 9:58 AM David Jacot
> >  > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Viktor,
> > > > > >
> > > > > > I can actually answer your question. KIP-848 already includes
> rack
> > > > > > awareness in the protocol. It is actually the other way around,
> > this
> > > > > > KIP takes the idea from KIP-848 to implement it in the current
> > > > > > protocol in order to realize the benefits sooner. The new
> protocol
> > > > > > will take a while to be implemented.
> > > > > >
> > > > > > Best,
> > > > > > David
> > > > > >
> > > > > > On Fri, Nov 4, 2022 at 10:55 AM David Jacot  >
> > > wrote:
> > > > > > >
> > > > > > > Hi Rajini,
> > > > > > >
> > > > > > > Thanks for the KIP. I have a few questions/comments:
> > > > > > >
> > > > > > > 01. If I understood correctly, the plan is to add new assignors
> > > which
> > > > > > > are rack aware. Is this right? I wonder if it is a judicious
> > choice
> > > > > > > here. The main drawback is that clients must be configured
> > > correctly
> > > > > > > in order to get the benefits. From a user experience
> perspective,
> > > it
> > > > > > > would be much better if we would only require our users to set
> > > > > > > client.rack. However, I understand the argument of keeping the
> > > > > > > existing assignors as-is in order to limit the risk but it also
> > > means
> > > > > > > that we will have to maintain multiple assignors with a
> somewhat
> > > > > > > similar core logic (within a rack). What do you think?
> > > > > > >
> > > > > > > 02. If we proceed with new rack-aware assignors, we should
> > mention
> > > > > > > their fully qualified names in the KIP as they will become part
> > of
> 

Re: [DISCUSS] Apache Kafka 3.3.0 Release

2022-08-30 Thread Artem Livshits
Hi  José ,

I found a potential regression in the new Sticky Partitioner logic, details
are here https://issues.apache.org/jira/browse/KAFKA-14156.  I've added a
draft PR, will add unit tests soon.  I think we should include the fix into
3.3.0.

-Artem

On Mon, Aug 29, 2022 at 1:17 PM Colin McCabe  wrote:

> Hi José,
>
> Thanks for creating the first RC.
>
> I found an issue where kafka-feature.sh needs some work for KRaft. So, it
> looks like we will have to sink this RC. I opened a blocker JIRA,
> KAFKA-14187, and attached a PR.
>
> This should not block testing of other parts of the release, so hopefully
> we will still get some good testing today.
>
> best,
> Colin
>
>
> On Mon, Aug 29, 2022, at 10:12, José Armando García Sancio wrote:
> > The documentation and protocol links are not working. Looking into it.
> >
> > https://kafka.apache.org/33/documentation.html
> > https://kafka.apache.org/33/protocol.html
> >
> > Thanks,
> > -José
>


[jira] [Created] (KAFKA-14156) Built-in partitioner may create suboptimal batches with large linger.ms

2022-08-10 Thread Artem Livshits (Jira)
Artem Livshits created KAFKA-14156:
--

 Summary: Built-in partitioner may create suboptimal batches with 
large linger.ms
 Key: KAFKA-14156
 URL: https://issues.apache.org/jira/browse/KAFKA-14156
 Project: Kafka
  Issue Type: Improvement
  Components: producer 
Affects Versions: 3.3.0
Reporter: Artem Livshits


The new built-in "sticky" partitioner switches partitions based on the amount 
of bytes produced to a partition.  It doesn't use batch creation as a switch 
trigger.  The previous "sticky" DefaultPartitioner switched partition when a 
new batch was created and with small linger.ms (default is 0) could result in 
sending larger batches to slower brokers potentially overloading them.  See 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
 for more detail.

However, the with large linger.ms, the new built-in partitioner may create 
suboptimal batches.  Let's consider an example, suppose linger.ms=500, 
batch.size=16KB (default) and we produce 24KB / sec, i.e. every 500ms we 
produce 12KB worth of data.  The new built-in partitioner would switch 
partition on every 16KB, so we could get into the following batching pattern:
 * produce 12KB to one partition in 500ms, hit linger, send 12KB batch
 * produce 4KB more to the same partition, now we've produced 16KB of data, 
switch partition
 * produce 12KB to the second partition in 500ms, hit linger, send 12KB batch
 * in the mean time the 4KB produced to the first partition would hit linger as 
well, sending 4KB batch
 * produce 4KB more to the second partition, now we've produced 16KB of data to 
the second partition, switch to 3rd partition

so in this scenario the new built-in partitioner produces a mix of 12KB and 4KB 
batches, while the previous DefaultPartitioner would produce only 12KB batches 
-- it switches on new batch creation, so there is no "mid-linger" leftover 
batches.

To avoid creation of batch fragmentation on partition switch, we can wait until 
the batch is ready before switching the partition, i.e. the condition to switch 
to a new partition would be "produced batch.size bytes" AND "batch is not 
lingering".  This may potentially introduce some non-uniformity into data 
distribution, but unlike the previous DefaultPartitioner, the non-uniformity 
would not be based on broker performance and won't re-introduce the bad pattern 
of sending more data to slower brokers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14087) Add jmh benchmark for producer with MockClient

2022-07-19 Thread Artem Livshits (Jira)
Artem Livshits created KAFKA-14087:
--

 Summary: Add jmh benchmark for producer with MockClient
 Key: KAFKA-14087
 URL: https://issues.apache.org/jira/browse/KAFKA-14087
 Project: Kafka
  Issue Type: Improvement
  Components: producer 
Reporter: Artem Livshits


Something like this
{code:java}
        Map configs = new HashMap<>();
        configs.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9000");
        configs.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "none");        
Time time = Time.SYSTEM;
        AtomicInteger offset = new AtomicInteger(0);        MetadataResponse 
initialUpdateResponse = RequestTestUtils.metadataUpdateWith(1, 
singletonMap("topic", 2));
        ProducerMetadata metadata = newMetadata(0, Long.MAX_VALUE);        
StringBuilder value = new StringBuilder("foo");
        for (int i = 0; i < 1000; i++)
            value.append("x");        AtomicInteger totalRecords = new 
AtomicInteger(0);
        long start = time.milliseconds();        CompletableFuture[] futures = 
new CompletableFuture[3];
        for (int i = 0; i < futures.length; i++) {
            futures[i] = CompletableFuture.runAsync(() -> {
                ScheduledExecutorService executorService = 
Executors.newSingleThreadScheduledExecutor();
                MockClient client = new MockClient(time, metadata) {
                    @Override
                    public void send(ClientRequest request, long now) {
                        super.send(request, now);
                        if (request.apiKey() == ApiKeys.PRODUCE) {
                            // Prepare response data from request.
                            ProduceResponseData responseData = new 
ProduceResponseData();                            ProduceRequest produceRequest 
= (ProduceRequest) request.requestBuilder().build();
                            produceRequest.data().topicData().forEach(topicData 
->
                                    
topicData.partitionData().forEach(partitionData -> {
                                        String topic = topicData.name();
                                        
ProduceResponseData.TopicProduceResponse tpr = 
responseData.responses().find(topic);
                                        if (tpr == null) {
                                            tpr = new 
ProduceResponseData.TopicProduceResponse().setName(topic);
                                            responseData.responses().add(tpr);
                                        }
                                        tpr.partitionResponses().add(new 
ProduceResponseData.PartitionProduceResponse()
                                                .setIndex(partitionData.index())
                                                
.setRecordErrors(Collections.emptyList())
                                                
.setBaseOffset(offset.addAndGet(1))
                                                
.setLogAppendTimeMs(time.milliseconds())
                                                .setLogStartOffset(0)
                                                .setErrorMessage("")
                                                
.setErrorCode(Errors.NONE.code()));
                                    }));                            // Schedule 
a reply to come after some time to mock broker latency.
                            executorService.schedule(() -> respond(new 
ProduceResponse(responseData)), 20, TimeUnit.MILLISECONDS);
                        }
                    }
                };                client.updateMetadata(initialUpdateResponse); 
               InitProducerIdResponseData responseData = new 
InitProducerIdResponseData()
                        .setErrorCode(Errors.NONE.code())
                        .setProducerEpoch((short) 0)
                        .setProducerId(42)
                        .setThrottleTimeMs(0);
                client.prepareResponse(body -> body instanceof 
InitProducerIdRequest,
                        new InitProducerIdResponse(responseData), false);       
         try (KafkaProducer producer = kafkaProducer(
                        configs,
                        new StringSerializer(),
                        new StringSerializer(),
                        metadata,
                        client,
                        null,
                        time
                )) {
                    final int records = 20_000_000;                    for (int 
k = 0; k < records; k++) {
                        producer.send(new ProducerRecord<>("topic", null, 
start, "key-" + k, value.toString()));
                    }                    totalRecords.addAndGet(records);
                }
            });
        }        for (CompletableFuture future : futures) {
            future.get();
        } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14086) Cleanup PlaintextConsumerTest.testInterceptors to not pass null record

2022-07-19 Thread Artem Livshits (Jira)
Artem Livshits created KAFKA-14086:
--

 Summary: Cleanup PlaintextConsumerTest.testInterceptors to not 
pass null record
 Key: KAFKA-14086
 URL: https://issues.apache.org/jira/browse/KAFKA-14086
 Project: Kafka
  Issue Type: Task
Reporter: Artem Livshits


See https://github.com/apache/kafka/pull/12365/files#r919746298



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14085) Clean up usage of asserts in KafkaProducer

2022-07-19 Thread Artem Livshits (Jira)
Artem Livshits created KAFKA-14085:
--

 Summary: Clean up usage of asserts in KafkaProducer
 Key: KAFKA-14085
 URL: https://issues.apache.org/jira/browse/KAFKA-14085
 Project: Kafka
  Issue Type: Task
  Components: producer 
Reporter: Artem Livshits


See https://github.com/apache/kafka/pull/12365/files#r919749970



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-847

2022-07-18 Thread Artem Livshits
Thank you for the vote.  I've got three +1s (Ismael, Jun, David), closing
the vote now.

-Artem

On Wed, Jul 13, 2022 at 1:42 AM Ismael Juma  wrote:

> Thanks for the updates, +1 (binding) from me.
>
> Ismael
>
> On Fri, Jul 8, 2022 at 3:45 AM Artem Livshits
>  wrote:
>
> > Hello,
> >
> > There was an additional discussion and the KIP got changed as a result of
> > that.  I would like to restart the vote on the updated
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-847%3A+Add+ProducerIdCount+metrics
> > .
> >
> > -Artem
> >
> > On Fri, Jun 24, 2022 at 7:49 PM Luke Chen  wrote:
> >
> > > Hi Artem,
> > >
> > > Thanks for the KIP.
> > > +1 (binding) from me.
> > >
> > > In addition to the `ProducerIdCount` in motivation section, the KIP
> title
> > > should also be updated.
> > >
> > > Luke
> > >
> > > On Fri, Jun 24, 2022 at 8:33 PM David Jacot
>  > >
> > > wrote:
> > >
> > > > Thanks for the KIP, Artem.
> > > >
> > > > I am +1 (binding).
> > > >
> > > > A small nit: ProducerIdCount should be used in the motivation.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On Thu, Jun 23, 2022 at 10:26 PM Artem Livshits
> > > >  wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I'd like to start a vote on KIP-847
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-847%3A+Add+ProducerCount+metrics
> > > > >
> > > > > -Artem
> > > >
> > >
> >
>


[jira] [Created] (KAFKA-14083) Check if we don't need to refresh time in RecordAccumulator.append

2022-07-18 Thread Artem Livshits (Jira)
Artem Livshits created KAFKA-14083:
--

 Summary: Check if we don't need to refresh time in 
RecordAccumulator.append
 Key: KAFKA-14083
 URL: https://issues.apache.org/jira/browse/KAFKA-14083
 Project: Kafka
  Issue Type: Task
  Components: producer 
Reporter: Artem Livshits


See https://github.com/apache/kafka/pull/12365/files#r912836877.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-847: Add ProducerCount metrics

2022-07-11 Thread Artem Livshits
Updated metric description as well.

-Artem

On Fri, Jul 8, 2022 at 9:22 AM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply. It would be useful to add that clarification in the
> description of the metric. Other than that, the KIP looks good to me.
>
> Jun
>
> On Tue, Jul 5, 2022 at 5:57 PM Artem Livshits
>  wrote:
>
> > I've updated the KIP to clarify that the metric reflects the total amount
> > of producer ids in all partitions maintained in the broker.
> >
> > -Artem
> >
> > On Thu, Jun 30, 2022 at 11:46 AM Jun Rao 
> wrote:
> >
> > > Hi, Artem,
> > >
> > > Thanks for the reply.
> > >
> > > The memory usage on the broker is proportional to the number of
> > (partition,
> > > pid) combinations. So, I am wondering if we could have a metric that
> > > captures that. The proposed pid count metric doesn't fully capture that
> > > since each pid could be associated with a different number of
> partitions.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Jun 30, 2022 at 9:24 AM Justine Olshan
> > > 
> > > wrote:
> > >
> > > > Hi Artem,
> > > > Thanks for the update to include motivation. Makes sense to me.
> > > > Justine
> > > >
> > > > On Wed, Jun 29, 2022 at 6:51 PM Luke Chen  wrote:
> > > >
> > > > > Hi Artem,
> > > > >
> > > > > Thanks for the update.
> > > > > LGTM.
> > > > >
> > > > > Luke
> > > > >
> > > > > On Thu, Jun 30, 2022 at 6:51 AM Artem Livshits
> > > > >  wrote:
> > > > >
> > > > > > Thank you for your feedback. I've updated the KIP to elaborate on
> > the
> > > > > > motivation and provide some background on producer ids and how we
> > > > measure
> > > > > > them.
> > > > > >
> > > > > > Also, after some thinking and discussing it offline with some
> > folks,
> > > I
> > > > > > think that we don't really need partitioner level metrics.  We
> can
> > > use
> > > > > > existing tools to do granular debugging.  I've moved partition
> > level
> > > > > > metrics to the rejected alternatives section.
> > > > > >
> > > > > > -Artem
> > > > > >
> > > > > > On Wed, Jun 29, 2022 at 1:57 AM Luke Chen 
> > wrote:
> > > > > >
> > > > > > > Hi Artem,
> > > > > > >
> > > > > > > Could you elaborate more in the motivation section?
> > > > > > > I'm interested to know what kind of scenarios this metric can
> > > benefit
> > > > > > for.
> > > > > > > What could it bring to us when a topic partition has 100
> > > > > ProducerIdCount
> > > > > > VS
> > > > > > > another topic partition has 10 ProducerIdCount?
> > > > > > >
> > > > > > > Thank you.
> > > > > > > Luke
> > > > > > >
> > > > > > > On Wed, Jun 29, 2022 at 6:30 AM Jun Rao
>  > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi, Artem,
> > > > > > > >
> > > > > > > > Thanks for the KIP.
> > > > > > > >
> > > > > > > > Could you explain the partition level ProducerIdCount a bit
> > more?
> > > > > Does
> > > > > > > that
> > > > > > > > reflect the number of PIDs ever produced to a partition since
> > the
> > > > > > broker
> > > > > > > is
> > > > > > > > started? Do we reduce the count after a PID expires?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Wed, Jun 22, 2022 at 1:08 AM David Jacot
> > > > > >  > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Artem,
> > > > > > > > >
> > > > > > > > > The KIP

Re: [VOTE] KIP-847

2022-07-07 Thread Artem Livshits
Hello,

There was an additional discussion and the KIP got changed as a result of
that.  I would like to restart the vote on the updated
https://cwiki.apache.org/confluence/display/KAFKA/KIP-847%3A+Add+ProducerIdCount+metrics
.

-Artem

On Fri, Jun 24, 2022 at 7:49 PM Luke Chen  wrote:

> Hi Artem,
>
> Thanks for the KIP.
> +1 (binding) from me.
>
> In addition to the `ProducerIdCount` in motivation section, the KIP title
> should also be updated.
>
> Luke
>
> On Fri, Jun 24, 2022 at 8:33 PM David Jacot 
> wrote:
>
> > Thanks for the KIP, Artem.
> >
> > I am +1 (binding).
> >
> > A small nit: ProducerIdCount should be used in the motivation.
> >
> > Best,
> > David
> >
> > On Thu, Jun 23, 2022 at 10:26 PM Artem Livshits
> >  wrote:
> > >
> > > Hello,
> > >
> > > I'd like to start a vote on KIP-847
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-847%3A+Add+ProducerCount+metrics
> > >
> > > -Artem
> >
>


Re: Transactions, delivery timeout and changing transactional producer behavior

2022-07-06 Thread Artem Livshits
Hi Daniel,

What you say makes sense.  Could you file a bug and put this info there so
that it's easier to track?

-Artem

On Wed, Jul 6, 2022 at 8:34 AM Dániel Urbán  wrote:

> Hello everyone,
>
> I've been investigating some transaction related issues in a very
> problematic cluster. Besides finding some interesting issues, I had some
> ideas about how transactional producer behavior could be improved.
>
> My suggestion in short is: when the transactional producer encounters an
> error which doesn't necessarily mean that the in-flight request was
> processed (for example a client side timeout), the producer should not send
> an EndTxnRequest on abort, but instead it should bump the producer epoch.
>
> The long description about the issue I found, and how I came to the
> suggestion:
>
> First, the description of the issue. When I say that the cluster is "very
> problematic", I mean all kinds of different issues, be it infra (disks and
> network) or throughput (high volume producers without fine tuning).
> In this cluster, Kafka transactions are widely used by many producers. And
> in this cluster, partitions get "stuck" frequently (few times every week).
>
> The exact meaning of a partition being "stuck" is this:
>
> On the client side:
> 1. A transactional producer sends X batches to a partition in a single
> transaction
> 2. Out of the X batches, the last few get sent, but are timed out thanks to
> the delivery timeout config
> 3. producer.flush() is unblocked due to all batches being "finished"
> 4. Based on the errors reported in the producer.send() callback,
> producer.abortTransaction() is called
> 5. Then producer.close() is also invoked with a 5s timeout (this
> application does not reuse the producer instances optimally)
> 6. The transactional.id of the producer is never reused (it was random
> generated)
>
> On the partition leader side (what appears in the log segment of the
> partition):
> 1. The batches sent by the producer are all appended to the log
> 2. But the ABORT marker of the transaction was appended before the last 1
> or 2 batches of the transaction
>
> On the transaction coordinator side (what appears in the transaction state
> partition):
> The transactional.id is present with the Empty state.
>
> These happenings result in the following:
> 1. The partition leader handles the first batch after the ABORT marker as
> the first message of a new transaction of the same producer id + epoch.
> (LSO is blocked at this point)
> 2. The transaction coordinator is not aware of any in-progress transaction
> of the producer, thus never aborting the transaction, not even after the
> transaction.timeout.ms passes.
>
> This is happening with Kafka 2.5 running in the cluster, producer versions
> range between 2.0 and 2.6.
> I scanned through a lot of tickets, and I believe that this issue is not
> specific to these versions, and could happen with newest versions as well.
> If I'm mistaken, some pointers would be appreciated.
>
> Assuming that the issue could occur with any version, I believe this issue
> boils down to one oversight on the client side:
> When a request fails without a definitive response (e.g. a delivery
> timeout), the client cannot assume that the request is "finished", and
> simply abort the transaction. If the request is still in flight, and the
> EndTxnRequest, then the WriteTxnMarkerRequest gets sent and processed
> earlier, the contract is violated by the client.
> This could be avoided by providing more information to the partition
> leader. Right now, a new transactional batch signals the start of a new
> transaction, and there is no way for the partition leader to decide whether
> the batch is an out-of-order message.
> In a naive and wasteful protocol, we could have a unique transaction id
> added to each batch and marker, meaning that the leader would be capable of
> refusing batches which arrive after the control marker of the transaction.
> But instead of changing the log format and the protocol, we can achieve the
> same by bumping the producer epoch.
>
> Bumping the epoch has a similar effect to "changing the transaction id" -
> the in-progress transaction will be aborted with a bumped producer epoch,
> telling the partition leader about the producer epoch change. From this
> point on, any batches sent with the old epoch will be refused by the leader
> due to the fencing mechanism. It doesn't really matter how many batches
> will get appended to the log, and how many will be refused - this is an
> aborted transaction - but the out-of-order message cannot occur, and cannot
> block the LSO infinitely.
>
> My suggestion is, that the TransactionManager inside the producer should
> keep track of what type of errors were encountered by the batches of the
> transaction, and categorize them along the lines of "definitely completed"
> and "might not be completed". When the transaction goes into an abortable
> state, and there is at least one batch with "might not be 

Re: [DISCUSS] KIP-847: Add ProducerCount metrics

2022-07-05 Thread Artem Livshits
I've updated the KIP to clarify that the metric reflects the total amount
of producer ids in all partitions maintained in the broker.

-Artem

On Thu, Jun 30, 2022 at 11:46 AM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply.
>
> The memory usage on the broker is proportional to the number of (partition,
> pid) combinations. So, I am wondering if we could have a metric that
> captures that. The proposed pid count metric doesn't fully capture that
> since each pid could be associated with a different number of partitions.
>
> Thanks,
>
> Jun
>
> On Thu, Jun 30, 2022 at 9:24 AM Justine Olshan
> 
> wrote:
>
> > Hi Artem,
> > Thanks for the update to include motivation. Makes sense to me.
> > Justine
> >
> > On Wed, Jun 29, 2022 at 6:51 PM Luke Chen  wrote:
> >
> > > Hi Artem,
> > >
> > > Thanks for the update.
> > > LGTM.
> > >
> > > Luke
> > >
> > > On Thu, Jun 30, 2022 at 6:51 AM Artem Livshits
> > >  wrote:
> > >
> > > > Thank you for your feedback. I've updated the KIP to elaborate on the
> > > > motivation and provide some background on producer ids and how we
> > measure
> > > > them.
> > > >
> > > > Also, after some thinking and discussing it offline with some folks,
> I
> > > > think that we don't really need partitioner level metrics.  We can
> use
> > > > existing tools to do granular debugging.  I've moved partition level
> > > > metrics to the rejected alternatives section.
> > > >
> > > > -Artem
> > > >
> > > > On Wed, Jun 29, 2022 at 1:57 AM Luke Chen  wrote:
> > > >
> > > > > Hi Artem,
> > > > >
> > > > > Could you elaborate more in the motivation section?
> > > > > I'm interested to know what kind of scenarios this metric can
> benefit
> > > > for.
> > > > > What could it bring to us when a topic partition has 100
> > > ProducerIdCount
> > > > VS
> > > > > another topic partition has 10 ProducerIdCount?
> > > > >
> > > > > Thank you.
> > > > > Luke
> > > > >
> > > > > On Wed, Jun 29, 2022 at 6:30 AM Jun Rao 
> > > > wrote:
> > > > >
> > > > > > Hi, Artem,
> > > > > >
> > > > > > Thanks for the KIP.
> > > > > >
> > > > > > Could you explain the partition level ProducerIdCount a bit more?
> > > Does
> > > > > that
> > > > > > reflect the number of PIDs ever produced to a partition since the
> > > > broker
> > > > > is
> > > > > > started? Do we reduce the count after a PID expires?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Wed, Jun 22, 2022 at 1:08 AM David Jacot
> > > >  > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Artem,
> > > > > > >
> > > > > > > The KIP LGTM.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > David
> > > > > > >
> > > > > > > On Tue, Jun 21, 2022 at 9:32 PM Artem Livshits
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > If there is no other feedback I'm going to start voting in a
> > > couple
> > > > > > days.
> > > > > > > >
> > > > > > > > -Artem
> > > > > > > >
> > > > > > > > On Fri, Jun 17, 2022 at 3:50 PM Artem Livshits <
> > > > > alivsh...@confluent.io
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thank you for your feedback.  Updated the KIP and added the
> > > > > Rejected
> > > > > > > > > Alternatives section.
> > > > > > > > >
> > > > > > > > > -Artem
> > > > > > > > >
> > > > > > > > > On Fri, Jun 17, 2022 at 1:16 PM Ismael Juma <
> > ism...@juma.me.uk
> > > >
> > > > > > wrote:
> > > > > > > 

Re: [DISCUSS] KIP-847: Add ProducerCount metrics

2022-06-29 Thread Artem Livshits
Thank you for your feedback. I've updated the KIP to elaborate on the
motivation and provide some background on producer ids and how we measure
them.

Also, after some thinking and discussing it offline with some folks, I
think that we don't really need partitioner level metrics.  We can use
existing tools to do granular debugging.  I've moved partition level
metrics to the rejected alternatives section.

-Artem

On Wed, Jun 29, 2022 at 1:57 AM Luke Chen  wrote:

> Hi Artem,
>
> Could you elaborate more in the motivation section?
> I'm interested to know what kind of scenarios this metric can benefit for.
> What could it bring to us when a topic partition has 100 ProducerIdCount VS
> another topic partition has 10 ProducerIdCount?
>
> Thank you.
> Luke
>
> On Wed, Jun 29, 2022 at 6:30 AM Jun Rao  wrote:
>
> > Hi, Artem,
> >
> > Thanks for the KIP.
> >
> > Could you explain the partition level ProducerIdCount a bit more? Does
> that
> > reflect the number of PIDs ever produced to a partition since the broker
> is
> > started? Do we reduce the count after a PID expires?
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Jun 22, 2022 at 1:08 AM David Jacot  >
> > wrote:
> >
> > > Hi Artem,
> > >
> > > The KIP LGTM.
> > >
> > > Thanks,
> > > David
> > >
> > > On Tue, Jun 21, 2022 at 9:32 PM Artem Livshits
> > >  wrote:
> > > >
> > > > If there is no other feedback I'm going to start voting in a couple
> > days.
> > > >
> > > > -Artem
> > > >
> > > > On Fri, Jun 17, 2022 at 3:50 PM Artem Livshits <
> alivsh...@confluent.io
> > >
> > > > wrote:
> > > >
> > > > > Thank you for your feedback.  Updated the KIP and added the
> Rejected
> > > > > Alternatives section.
> > > > >
> > > > > -Artem
> > > > >
> > > > > On Fri, Jun 17, 2022 at 1:16 PM Ismael Juma 
> > wrote:
> > > > >
> > > > >> If we don't track them separately, then it makes sense to keep it
> as
> > > one
> > > > >> metric. I'd probably name it ProducerIdCount in that case.
> > > > >>
> > > > >> Ismael
> > > > >>
> > > > >> On Fri, Jun 17, 2022 at 12:04 PM Artem Livshits
> > > > >>  wrote:
> > > > >>
> > > > >> > Do you propose to have 2 metrics instead of one?  Right now we
> > don't
> > > > >> track
> > > > >> > if the producer id was transactional or idempotent and for
> metric
> > > > >> > collection we'd either have to pay the cost of iterating over
> > > producer
> > > > >> ids
> > > > >> > (which could be a lot) or split the producer map into 2 or cache
> > the
> > > > >> > counts, which complicates the code.
> > > > >> >
> > > > >> > From the monitoring perspective, I think one metric should be
> > good,
> > > but
> > > > >> > maybe I'm missing some scenarios.
> > > > >> >
> > > > >> > -Artem
> > > > >> >
> > > > >> > On Fri, Jun 17, 2022 at 12:28 AM Ismael Juma  >
> > > wrote:
> > > > >> >
> > > > >> > > I like the suggestion to have IdempotentProducerCount and
> > > > >> > > TransactionalProducerCount metrics.
> > > > >> > >
> > > > >> > > Ismael
> > > > >> > >
> > > > >> > > On Thu, Jun 16, 2022 at 2:27 PM Artem Livshits
> > > > >> > >  wrote:
> > > > >> > >
> > > > >> > > > Hi Ismael,
> > > > >> > > >
> > > > >> > > > Thank you for your feedback.  Yes, this is counting the
> number
> > > of
> > > > >> > > producer
> > > > >> > > > ids tracked by the partition and broker.  Another options I
> > was
> > > > >> > thinking
> > > > >> > > of
> > > > >> > > > are the following:
> > > > >> > > >
> > > > >> > > > - IdempotentProducerCount
> > > > >> > > > - TransactionalProducerCount
> > > > >> > > > - ProducerIdCount
> > > > >> > > >
> > > > >> > > > Let me know if one of these seems better, or I'm open to
> other
> > > name
> > > > >> > > > suggestions as well.
> > > > >> > > >
> > > > >> > > > -Artem
> > > > >> > > >
> > > > >> > > > On Wed, Jun 15, 2022 at 11:49 PM Ismael Juma <
> > ism...@juma.me.uk
> > > >
> > > > >> > wrote:
> > > > >> > > >
> > > > >> > > > > Thanks for the KIP.
> > > > >> > > > >
> > > > >> > > > > ProducerCount seems like a misleading name since producers
> > > > >> without a
> > > > >> > > > > producer id are not counted. Is this meant to count the
> > > number of
> > > > >> > > > producer
> > > > >> > > > > IDs tracked by the broker?
> > > > >> > > > >
> > > > >> > > > > Ismael
> > > > >> > > > >
> > > > >> > > > > On Wed, Jun 15, 2022, 3:12 PM Artem Livshits <
> > > > >> alivsh...@confluent.io
> > > > >> > > > > .invalid>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hello,
> > > > >> > > > > >
> > > > >> > > > > > I'd like to start a discussion on the KIP-847:
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-847%3A+Add+ProducerCount+metrics
> > > > >> > > > > > .
> > > > >> > > > > >
> > > > >> > > > > > -Artem
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > >
> >
>


[VOTE] KIP-847

2022-06-23 Thread Artem Livshits
Hello,

I'd like to start a vote on KIP-847
https://cwiki.apache.org/confluence/display/KAFKA/KIP-847%3A+Add+ProducerCount+metrics

-Artem


Re: [DISCUSS] KIP-847: Add ProducerCount metrics

2022-06-21 Thread Artem Livshits
If there is no other feedback I'm going to start voting in a couple days.

-Artem

On Fri, Jun 17, 2022 at 3:50 PM Artem Livshits 
wrote:

> Thank you for your feedback.  Updated the KIP and added the Rejected
> Alternatives section.
>
> -Artem
>
> On Fri, Jun 17, 2022 at 1:16 PM Ismael Juma  wrote:
>
>> If we don't track them separately, then it makes sense to keep it as one
>> metric. I'd probably name it ProducerIdCount in that case.
>>
>> Ismael
>>
>> On Fri, Jun 17, 2022 at 12:04 PM Artem Livshits
>>  wrote:
>>
>> > Do you propose to have 2 metrics instead of one?  Right now we don't
>> track
>> > if the producer id was transactional or idempotent and for metric
>> > collection we'd either have to pay the cost of iterating over producer
>> ids
>> > (which could be a lot) or split the producer map into 2 or cache the
>> > counts, which complicates the code.
>> >
>> > From the monitoring perspective, I think one metric should be good, but
>> > maybe I'm missing some scenarios.
>> >
>> > -Artem
>> >
>> > On Fri, Jun 17, 2022 at 12:28 AM Ismael Juma  wrote:
>> >
>> > > I like the suggestion to have IdempotentProducerCount and
>> > > TransactionalProducerCount metrics.
>> > >
>> > > Ismael
>> > >
>> > > On Thu, Jun 16, 2022 at 2:27 PM Artem Livshits
>> > >  wrote:
>> > >
>> > > > Hi Ismael,
>> > > >
>> > > > Thank you for your feedback.  Yes, this is counting the number of
>> > > producer
>> > > > ids tracked by the partition and broker.  Another options I was
>> > thinking
>> > > of
>> > > > are the following:
>> > > >
>> > > > - IdempotentProducerCount
>> > > > - TransactionalProducerCount
>> > > > - ProducerIdCount
>> > > >
>> > > > Let me know if one of these seems better, or I'm open to other name
>> > > > suggestions as well.
>> > > >
>> > > > -Artem
>> > > >
>> > > > On Wed, Jun 15, 2022 at 11:49 PM Ismael Juma 
>> > wrote:
>> > > >
>> > > > > Thanks for the KIP.
>> > > > >
>> > > > > ProducerCount seems like a misleading name since producers
>> without a
>> > > > > producer id are not counted. Is this meant to count the number of
>> > > > producer
>> > > > > IDs tracked by the broker?
>> > > > >
>> > > > > Ismael
>> > > > >
>> > > > > On Wed, Jun 15, 2022, 3:12 PM Artem Livshits <
>> alivsh...@confluent.io
>> > > > > .invalid>
>> > > > > wrote:
>> > > > >
>> > > > > > Hello,
>> > > > > >
>> > > > > > I'd like to start a discussion on the KIP-847:
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-847%3A+Add+ProducerCount+metrics
>> > > > > > .
>> > > > > >
>> > > > > > -Artem
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>


Re: [DISCUSS] KIP-847: Add ProducerCount metrics

2022-06-17 Thread Artem Livshits
Thank you for your feedback.  Updated the KIP and added the Rejected
Alternatives section.

-Artem

On Fri, Jun 17, 2022 at 1:16 PM Ismael Juma  wrote:

> If we don't track them separately, then it makes sense to keep it as one
> metric. I'd probably name it ProducerIdCount in that case.
>
> Ismael
>
> On Fri, Jun 17, 2022 at 12:04 PM Artem Livshits
>  wrote:
>
> > Do you propose to have 2 metrics instead of one?  Right now we don't
> track
> > if the producer id was transactional or idempotent and for metric
> > collection we'd either have to pay the cost of iterating over producer
> ids
> > (which could be a lot) or split the producer map into 2 or cache the
> > counts, which complicates the code.
> >
> > From the monitoring perspective, I think one metric should be good, but
> > maybe I'm missing some scenarios.
> >
> > -Artem
> >
> > On Fri, Jun 17, 2022 at 12:28 AM Ismael Juma  wrote:
> >
> > > I like the suggestion to have IdempotentProducerCount and
> > > TransactionalProducerCount metrics.
> > >
> > > Ismael
> > >
> > > On Thu, Jun 16, 2022 at 2:27 PM Artem Livshits
> > >  wrote:
> > >
> > > > Hi Ismael,
> > > >
> > > > Thank you for your feedback.  Yes, this is counting the number of
> > > producer
> > > > ids tracked by the partition and broker.  Another options I was
> > thinking
> > > of
> > > > are the following:
> > > >
> > > > - IdempotentProducerCount
> > > > - TransactionalProducerCount
> > > > - ProducerIdCount
> > > >
> > > > Let me know if one of these seems better, or I'm open to other name
> > > > suggestions as well.
> > > >
> > > > -Artem
> > > >
> > > > On Wed, Jun 15, 2022 at 11:49 PM Ismael Juma 
> > wrote:
> > > >
> > > > > Thanks for the KIP.
> > > > >
> > > > > ProducerCount seems like a misleading name since producers without
> a
> > > > > producer id are not counted. Is this meant to count the number of
> > > > producer
> > > > > IDs tracked by the broker?
> > > > >
> > > > > Ismael
> > > > >
> > > > > On Wed, Jun 15, 2022, 3:12 PM Artem Livshits <
> alivsh...@confluent.io
> > > > > .invalid>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I'd like to start a discussion on the KIP-847:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-847%3A+Add+ProducerCount+metrics
> > > > > > .
> > > > > >
> > > > > > -Artem
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-847: Add ProducerCount metrics

2022-06-17 Thread Artem Livshits
Do you propose to have 2 metrics instead of one?  Right now we don't track
if the producer id was transactional or idempotent and for metric
collection we'd either have to pay the cost of iterating over producer ids
(which could be a lot) or split the producer map into 2 or cache the
counts, which complicates the code.

>From the monitoring perspective, I think one metric should be good, but
maybe I'm missing some scenarios.

-Artem

On Fri, Jun 17, 2022 at 12:28 AM Ismael Juma  wrote:

> I like the suggestion to have IdempotentProducerCount and
> TransactionalProducerCount metrics.
>
> Ismael
>
> On Thu, Jun 16, 2022 at 2:27 PM Artem Livshits
>  wrote:
>
> > Hi Ismael,
> >
> > Thank you for your feedback.  Yes, this is counting the number of
> producer
> > ids tracked by the partition and broker.  Another options I was thinking
> of
> > are the following:
> >
> > - IdempotentProducerCount
> > - TransactionalProducerCount
> > - ProducerIdCount
> >
> > Let me know if one of these seems better, or I'm open to other name
> > suggestions as well.
> >
> > -Artem
> >
> > On Wed, Jun 15, 2022 at 11:49 PM Ismael Juma  wrote:
> >
> > > Thanks for the KIP.
> > >
> > > ProducerCount seems like a misleading name since producers without a
> > > producer id are not counted. Is this meant to count the number of
> > producer
> > > IDs tracked by the broker?
> > >
> > > Ismael
> > >
> > > On Wed, Jun 15, 2022, 3:12 PM Artem Livshits  > > .invalid>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I'd like to start a discussion on the KIP-847:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-847%3A+Add+ProducerCount+metrics
> > > > .
> > > >
> > > > -Artem
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-847: Add ProducerCount metrics

2022-06-16 Thread Artem Livshits
Hi Ismael,

Thank you for your feedback.  Yes, this is counting the number of producer
ids tracked by the partition and broker.  Another options I was thinking of
are the following:

- IdempotentProducerCount
- TransactionalProducerCount
- ProducerIdCount

Let me know if one of these seems better, or I'm open to other name
suggestions as well.

-Artem

On Wed, Jun 15, 2022 at 11:49 PM Ismael Juma  wrote:

> Thanks for the KIP.
>
> ProducerCount seems like a misleading name since producers without a
> producer id are not counted. Is this meant to count the number of producer
> IDs tracked by the broker?
>
> Ismael
>
> On Wed, Jun 15, 2022, 3:12 PM Artem Livshits  .invalid>
> wrote:
>
> > Hello,
> >
> > I'd like to start a discussion on the KIP-847:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-847%3A+Add+ProducerCount+metrics
> > .
> >
> > -Artem
> >
>


[DISCUSS] KIP-847: Add ProducerCount metrics

2022-06-15 Thread Artem Livshits
Hello,

I'd like to start a discussion on the KIP-847:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-847%3A+Add+ProducerCount+metrics
.

-Artem


[jira] [Created] (KAFKA-13999) Add ProducerCount metrics (KIP-847)

2022-06-15 Thread Artem Livshits (Jira)
Artem Livshits created KAFKA-13999:
--

 Summary: Add ProducerCount metrics (KIP-847)
 Key: KAFKA-13999
 URL: https://issues.apache.org/jira/browse/KAFKA-13999
 Project: Kafka
  Issue Type: Improvement
Reporter: Artem Livshits


See 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-847%3A+Add+ProducerCount+metrics



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (KAFKA-13992) MockProducer shouldn't use DefaultPartitioner

2022-06-14 Thread Artem Livshits (Jira)
Artem Livshits created KAFKA-13992:
--

 Summary: MockProducer shouldn't use DefaultPartitioner
 Key: KAFKA-13992
 URL: https://issues.apache.org/jira/browse/KAFKA-13992
 Project: Kafka
  Issue Type: Task
  Components: producer 
Reporter: Artem Livshits


DefaultPartitioner got deprecated as part of 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner#KIP794:StrictlyUniformStickyPartitioner-TestResults.]
  Also, MockProducer doesn't seem to call .onNewBatch, which means it doesn't 
really trigger partition switch.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: [DISCUSS] KIP-841: Fenced replicas should not be allowed to join the ISR in KRaft

2022-05-20 Thread Artem Livshits
The KIP LGTM.  My only question is why it's an issue with KRaft -- looks
like ZK would have the same issue?

-Artem

On Fri, May 20, 2022 at 8:51 AM David Jacot 
wrote:

> This KIP is pretty straight forward. I will start a vote on Monday
> if no one objects.
>
> Best,
> David
>
> On Wed, May 18, 2022 at 5:55 PM David Jacot  wrote:
> >
> > Hi,
> >
> > I created a small KIP to strengthen the AlterPartition API in KRaft mode:
> > https://cwiki.apache.org/confluence/x/phmhD
> >
> > Let me know what you think.
> >
> > Best,
> > David
>


[jira] [Created] (KAFKA-13885) Add new metrics for partitioner logic introduced in KIP-794

2022-05-06 Thread Artem Livshits (Jira)
Artem Livshits created KAFKA-13885:
--

 Summary: Add new metrics for partitioner logic introduced in 
KIP-794
 Key: KAFKA-13885
 URL: https://issues.apache.org/jira/browse/KAFKA-13885
 Project: Kafka
  Issue Type: Improvement
Reporter: Artem Livshits


[https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner]
 introduced new partitioning logic, it would be good to get some observability 
into the logic.  For example, one metric could be the number of brokers that we 
marked unavailable because their latency exceeded 
*partitioner.availability.timeout.ms.*



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: [VOTE] KIP-794: Strictly Uniform Sticky Partitioner

2022-05-05 Thread Artem Livshits
Hi  Guozhang,

The current DefaultStreamPartitioner behavior is not changed by this KIP,
so I think we can track it separately.  I've created a ticket:
https://issues.apache.org/jira/browse/KAFKA-13880.

-Artem



On Thu, May 5, 2022 at 10:24 AM Guozhang Wang  wrote:

> Hello Artem,
>
> Thanks for proposing this KIP. I took a look at the current PR and also
> thought about its implications on Kafka Streams. Here are some thoughts:
>
> Today Kafka Streams use an explicit Partitioner --- note it is not
> implementing the Producer's Partitioner --- to determine the partition
> number before calling `producer.send`. Also the record is serialized
> outside the `send` call as well. That means, the record sent to the
> producer is always in `` type and partition id is always
> specified.
>
> The reason for serializing outside the producer is that the same producer
> is used to send to various topics with different schema, and hence we
> cannot specify a single serializer config. And the reason to determine the
> partition id outside the producer is that we have different logic for
> windowed record v.s. non-windowed record and hence cannot have a single
> partitioner config.
>
> For the windowed partitioner it does not matter since the key would always
> be specified and hence we would not leverage sticky partitioning. For the
> non-windowed partitioner it's possible that the key is null, and inside the
> non-windowed customized partitioner, the `DefaultPartitioner` is used
> indeed. But with this KIP as the new logic is not encoded in the configured
> partitioner it means Kafka Streams would not be able to leverage its
> benefits.
>
> I think we can modify the non-windowed partitioner such that when the key
> is null, we just set the partition to null, then inside the KafkaProducer
> we could still leverage on the new sticky behavior. Since in Kafka Streams
> only sink topics data may have null keys which would not be required state
> metadata, relaxing this in the StreamsPartitioner should be fine.
>
>
> Guozhang
>
>
> On Sat, Mar 26, 2022 at 4:05 PM Lucas Bradstreet
> 
> wrote:
>
> > Hi Artem,
> >
> > Thank you for all the work on this. I think it'll be quite impactful.
> >
> > +1 non-binding from me.
> >
> > Lucas
> >
> > On Wed, Mar 23, 2022 at 8:27 PM Luke Chen  wrote:
> >
> > > Hi Artem,
> > >
> > > Thanks for the KIP and the patience during discussion!
> > > +1 binding from me.
> > >
> > > Luke
> > >
> > > On Thu, Mar 24, 2022 at 3:43 AM Ismael Juma  wrote:
> > >
> > > > Thanks for the KIP and for taking the time to address all the
> feedback.
> > > +1
> > > > (binding)
> > > >
> > > > Ismael
> > > >
> > > > On Mon, Mar 21, 2022 at 4:52 PM Artem Livshits
> > > >  wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I'd like to start a vote on
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > .
> > > > >
> > > > > -Artem
> > > > >
> > > >
> > >
> >
>
>
> --
> -- Guozhang
>


[jira] [Created] (KAFKA-13880) DefaultStreamPartitioner may get "stuck" to one partition for unkeyed messages

2022-05-05 Thread Artem Livshits (Jira)
Artem Livshits created KAFKA-13880:
--

 Summary: DefaultStreamPartitioner may get "stuck" to one partition 
for unkeyed messages
 Key: KAFKA-13880
 URL: https://issues.apache.org/jira/browse/KAFKA-13880
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 2.4.0
Reporter: Artem Livshits


While working on KIP-794, I noticed that DefaultStreamPartitioner does not call 
.onNewBatch.  The "sticky" DefaultStreamPartitioner introduced as a result of 
https://issues.apache.org/jira/browse/KAFKA-8601 requires .onNewBatch call in 
order to switch to a new partitions for unkeyed messages, just calling 
.partition would return the same "sticky" partition chosen during the first 
call to .partition.  The partition doesn't change even if the partition leader 
is unavailable.

Ideally, for unkeyed messages the DefaultStreamPartitioner should take 
advantage of the new built-in partitioning logic introduced in 
[https://github.com/apache/kafka/pull/12049.]  Perhaps, it could return null 
partition for unkeyed message, so that KafkaProducer could run built-in 
partitioning logic.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[VOTE] KIP-794: Strictly Uniform Sticky Partitioner

2022-03-21 Thread Artem Livshits
Hi all,

I'd like to start a vote on
https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
.

-Artem


Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2022-03-17 Thread Artem Livshits
Hi all,

Thank you to everybody who contributed to this discussion, we've come a
long way and now the "rejected alternatives" section is almost as big as
the actual proposal (which indicates that it's a well thought through
solution :-)).  If there are no further considerations, I'll start voting
in the next couple of days.

-Artem

On Mon, Mar 14, 2022 at 6:19 PM Artem Livshits 
wrote:

> Hi Jun,
>
> 33. Sounds good.  Updated the KIP.
>
> -Artem
>
> On Mon, Mar 14, 2022 at 5:45 PM Jun Rao  wrote:
>
>> Hi, Artem,
>>
>> 33. We introduced onNewBatch() primarily for the sticky partitioner. It
>> seems to be a very subtle API to explain and to use properly. If we can't
>> find any convincing usage, it's probably better to deprecate it so that we
>> could keep the API clean.
>>
>> Thanks,
>>
>> Jun
>>
>> On Mon, Mar 14, 2022 at 1:36 PM Artem Livshits
>>  wrote:
>>
>> > Hi Jun,
>> >
>> > 33.  That's an interesting point.  Technically, onNewBatch is just a
>> way to
>> > pass some signal to the partitioner, the sticky partitioner uses this
>> > signal that is suboptimal, in theory someone could use it for something
>> > else
>> >
>> > -Artem
>> >
>> > On Mon, Mar 14, 2022 at 9:11 AM Jun Rao 
>> wrote:
>> >
>> > > Hi, Artem,
>> > >
>> > > Thanks for the reply. A couple of more comments.
>> > >
>> > > 32. We could defer the metrics until we know what to add.
>> > >
>> > > 33. Since we are deprecating DefaultPartitioner and
>> > > UniformStickyPartitioner, should we depreciate
>> Partitioner.onNewBatch()
>> > too
>> > > given its unexpected side effect?
>> > >
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > > On Thu, Mar 10, 2022 at 5:20 PM Artem Livshits
>> > >  wrote:
>> > >
>> > > > Hi Jun,
>> > > >
>> > > > 32. Good point.  Do you think it's ok to defer the metrics until we
>> run
>> > > > some benchmarks so that we get a better idea of what metrics we
>> need?
>> > > >
>> > > > -Artem
>> > > >
>> > > > On Thu, Mar 10, 2022 at 3:12 PM Jun Rao 
>> > > wrote:
>> > > >
>> > > > > Hi, Artem.
>> > > > >
>> > > > > Thanks for the reply. One more comment.
>> > > > >
>> > > > > 32. Do we need to add any new metric on the producer? For
>> example, if
>> > > > > partitioner.availability.timeout.ms is > 0, it might be useful to
>> > know
>> > > > the
>> > > > > number of unavailable partitions.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > > On Thu, Mar 10, 2022 at 12:46 PM Artem Livshits
>> > > > >  wrote:
>> > > > >
>> > > > > > Hi Jun,
>> > > > > >
>> > > > > > 30.  Clarified.
>> > > > > >
>> > > > > > 31. I plan to do some benchmarking once implementation is
>> finished,
>> > > > I'll
>> > > > > > update the KIP with the results once I have them.  The reason to
>> > make
>> > > > it
>> > > > > > default is that it won't be used otherwise and we won't know if
>> > it's
>> > > > good
>> > > > > > or not in practical workloads.
>> > > > > >
>> > > > > > -Artem
>> > > > > >
>> > > > > > On Thu, Mar 10, 2022 at 11:42 AM Jun Rao
>> > > >
>> > > > > wrote:
>> > > > > >
>> > > > > > > Hi, Artem,
>> > > > > > >
>> > > > > > > Thanks for the updated KIP. A couple of more comments.
>> > > > > > >
>> > > > > > > 30. For the 3 new configs, it would be useful to make it clear
>> > that
>> > > > > they
>> > > > > > > are only relevant when the partitioner class is null.
>> > > > > > >
>> > > > > > > 31. partitioner.adaptive.partitioning.enable : I am wondering
>> > > whether
>&

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2022-03-14 Thread Artem Livshits
Hi Jun,

33. Sounds good.  Updated the KIP.

-Artem

On Mon, Mar 14, 2022 at 5:45 PM Jun Rao  wrote:

> Hi, Artem,
>
> 33. We introduced onNewBatch() primarily for the sticky partitioner. It
> seems to be a very subtle API to explain and to use properly. If we can't
> find any convincing usage, it's probably better to deprecate it so that we
> could keep the API clean.
>
> Thanks,
>
> Jun
>
> On Mon, Mar 14, 2022 at 1:36 PM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > 33.  That's an interesting point.  Technically, onNewBatch is just a way
> to
> > pass some signal to the partitioner, the sticky partitioner uses this
> > signal that is suboptimal, in theory someone could use it for something
> > else
> >
> > -Artem
> >
> > On Mon, Mar 14, 2022 at 9:11 AM Jun Rao 
> wrote:
> >
> > > Hi, Artem,
> > >
> > > Thanks for the reply. A couple of more comments.
> > >
> > > 32. We could defer the metrics until we know what to add.
> > >
> > > 33. Since we are deprecating DefaultPartitioner and
> > > UniformStickyPartitioner, should we depreciate Partitioner.onNewBatch()
> > too
> > > given its unexpected side effect?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Mar 10, 2022 at 5:20 PM Artem Livshits
> > >  wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > 32. Good point.  Do you think it's ok to defer the metrics until we
> run
> > > > some benchmarks so that we get a better idea of what metrics we need?
> > > >
> > > > -Artem
> > > >
> > > > On Thu, Mar 10, 2022 at 3:12 PM Jun Rao 
> > > wrote:
> > > >
> > > > > Hi, Artem.
> > > > >
> > > > > Thanks for the reply. One more comment.
> > > > >
> > > > > 32. Do we need to add any new metric on the producer? For example,
> if
> > > > > partitioner.availability.timeout.ms is > 0, it might be useful to
> > know
> > > > the
> > > > > number of unavailable partitions.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Mar 10, 2022 at 12:46 PM Artem Livshits
> > > > >  wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > 30.  Clarified.
> > > > > >
> > > > > > 31. I plan to do some benchmarking once implementation is
> finished,
> > > > I'll
> > > > > > update the KIP with the results once I have them.  The reason to
> > make
> > > > it
> > > > > > default is that it won't be used otherwise and we won't know if
> > it's
> > > > good
> > > > > > or not in practical workloads.
> > > > > >
> > > > > > -Artem
> > > > > >
> > > > > > On Thu, Mar 10, 2022 at 11:42 AM Jun Rao
>  > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi, Artem,
> > > > > > >
> > > > > > > Thanks for the updated KIP. A couple of more comments.
> > > > > > >
> > > > > > > 30. For the 3 new configs, it would be useful to make it clear
> > that
> > > > > they
> > > > > > > are only relevant when the partitioner class is null.
> > > > > > >
> > > > > > > 31. partitioner.adaptive.partitioning.enable : I am wondering
> > > whether
> > > > > it
> > > > > > > should default to true. This is a more complex behavior than
> > > "uniform
> > > > > > > sticky" and may take some time to get right. If we do want to
> > > enable
> > > > it
> > > > > > by
> > > > > > > default, it would be useful to validate it with some test
> > results.
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
> > > > > > >  wrote:
> > > > > > >
> > > > > > > > Thank you for feedback, I've discussed this offline with some
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2022-03-14 Thread Artem Livshits
Hi Jun,

33.  That's an interesting point.  Technically, onNewBatch is just a way to
pass some signal to the partitioner, the sticky partitioner uses this
signal that is suboptimal, in theory someone could use it for something else

-Artem

On Mon, Mar 14, 2022 at 9:11 AM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply. A couple of more comments.
>
> 32. We could defer the metrics until we know what to add.
>
> 33. Since we are deprecating DefaultPartitioner and
> UniformStickyPartitioner, should we depreciate Partitioner.onNewBatch() too
> given its unexpected side effect?
>
> Thanks,
>
> Jun
>
> On Thu, Mar 10, 2022 at 5:20 PM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > 32. Good point.  Do you think it's ok to defer the metrics until we run
> > some benchmarks so that we get a better idea of what metrics we need?
> >
> > -Artem
> >
> > On Thu, Mar 10, 2022 at 3:12 PM Jun Rao 
> wrote:
> >
> > > Hi, Artem.
> > >
> > > Thanks for the reply. One more comment.
> > >
> > > 32. Do we need to add any new metric on the producer? For example, if
> > > partitioner.availability.timeout.ms is > 0, it might be useful to know
> > the
> > > number of unavailable partitions.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Mar 10, 2022 at 12:46 PM Artem Livshits
> > >  wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > 30.  Clarified.
> > > >
> > > > 31. I plan to do some benchmarking once implementation is finished,
> > I'll
> > > > update the KIP with the results once I have them.  The reason to make
> > it
> > > > default is that it won't be used otherwise and we won't know if it's
> > good
> > > > or not in practical workloads.
> > > >
> > > > -Artem
> > > >
> > > > On Thu, Mar 10, 2022 at 11:42 AM Jun Rao 
> > > wrote:
> > > >
> > > > > Hi, Artem,
> > > > >
> > > > > Thanks for the updated KIP. A couple of more comments.
> > > > >
> > > > > 30. For the 3 new configs, it would be useful to make it clear that
> > > they
> > > > > are only relevant when the partitioner class is null.
> > > > >
> > > > > 31. partitioner.adaptive.partitioning.enable : I am wondering
> whether
> > > it
> > > > > should default to true. This is a more complex behavior than
> "uniform
> > > > > sticky" and may take some time to get right. If we do want to
> enable
> > it
> > > > by
> > > > > default, it would be useful to validate it with some test results.
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
> > > > >  wrote:
> > > > >
> > > > > > Thank you for feedback, I've discussed this offline with some of
> > the
> > > > > folks
> > > > > > and updated the KIP.  The main change is that now instead of
> using
> > > > > > DefaultPartitioner and UniformStickyPartitioners as flags, in the
> > new
> > > > > > proposal the default partitioner is null, so if no custom
> > partitioner
> > > > is
> > > > > > specified then the partitioning logic is implemented in
> > > KafkaProducer.
> > > > > > Compatibility section is updated as well.  Also the configuration
> > > > options
> > > > > > are renamed to be more consistent.
> > > > > >
> > > > > > -Artem
> > > > > >
> > > > > > On Fri, Mar 4, 2022 at 10:38 PM Luke Chen 
> > wrote:
> > > > > >
> > > > > > > Hi Artem,
> > > > > > >
> > > > > > > Thanks for your explanation and update to the KIP.
> > > > > > > Some comments:
> > > > > > >
> > > > > > > 5. In the description for `enable.adaptive.partitioning`, the
> > > `false`
> > > > > > case,
> > > > > > > you said:
> > > > > > > > the producer will try to distribute messages uniformly.
> > > > > > > I think we should describe the possible skewing distribution.
> > > > > Ot

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2022-03-10 Thread Artem Livshits
Hi Jun,

32. Good point.  Do you think it's ok to defer the metrics until we run
some benchmarks so that we get a better idea of what metrics we need?

-Artem

On Thu, Mar 10, 2022 at 3:12 PM Jun Rao  wrote:

> Hi, Artem.
>
> Thanks for the reply. One more comment.
>
> 32. Do we need to add any new metric on the producer? For example, if
> partitioner.availability.timeout.ms is > 0, it might be useful to know the
> number of unavailable partitions.
>
> Thanks,
>
> Jun
>
> On Thu, Mar 10, 2022 at 12:46 PM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > 30.  Clarified.
> >
> > 31. I plan to do some benchmarking once implementation is finished, I'll
> > update the KIP with the results once I have them.  The reason to make it
> > default is that it won't be used otherwise and we won't know if it's good
> > or not in practical workloads.
> >
> > -Artem
> >
> > On Thu, Mar 10, 2022 at 11:42 AM Jun Rao 
> wrote:
> >
> > > Hi, Artem,
> > >
> > > Thanks for the updated KIP. A couple of more comments.
> > >
> > > 30. For the 3 new configs, it would be useful to make it clear that
> they
> > > are only relevant when the partitioner class is null.
> > >
> > > 31. partitioner.adaptive.partitioning.enable : I am wondering whether
> it
> > > should default to true. This is a more complex behavior than "uniform
> > > sticky" and may take some time to get right. If we do want to enable it
> > by
> > > default, it would be useful to validate it with some test results.
> > >
> > > Jun
> > >
> > >
> > >
> > >
> > > On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
> > >  wrote:
> > >
> > > > Thank you for feedback, I've discussed this offline with some of the
> > > folks
> > > > and updated the KIP.  The main change is that now instead of using
> > > > DefaultPartitioner and UniformStickyPartitioners as flags, in the new
> > > > proposal the default partitioner is null, so if no custom partitioner
> > is
> > > > specified then the partitioning logic is implemented in
> KafkaProducer.
> > > > Compatibility section is updated as well.  Also the configuration
> > options
> > > > are renamed to be more consistent.
> > > >
> > > > -Artem
> > > >
> > > > On Fri, Mar 4, 2022 at 10:38 PM Luke Chen  wrote:
> > > >
> > > > > Hi Artem,
> > > > >
> > > > > Thanks for your explanation and update to the KIP.
> > > > > Some comments:
> > > > >
> > > > > 5. In the description for `enable.adaptive.partitioning`, the
> `false`
> > > > case,
> > > > > you said:
> > > > > > the producer will try to distribute messages uniformly.
> > > > > I think we should describe the possible skewing distribution.
> > > Otherwise,
> > > > > user might be confused about why adaptive partitioning is
> important.
> > > > >
> > > > > 6. In the description for `partition.availability.timeout.ms`, I
> > think
> > > > we
> > > > > should mention in the last sentence about if
> > > > `enable.adaptive.partitioning`
> > > > > is disabled this logic is also disabled.
> > > > >
> > > > > 7. Similar thoughts as Ismael, I think we should have a POC and
> test
> > to
> > > > > prove that this adaptive partitioning algorithm can have better
> > uniform
> > > > > partitioning, compared with original sticky one.
> > > > >
> > > > > Thank you.
> > > > > Luke
> > > > >
> > > > > On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma 
> > wrote:
> > > > >
> > > > > > Regarding `3`, we should only deprecate it if we're sure the new
> > > > approach
> > > > > > handles all cases better. Are we confident about that for both of
> > the
> > > > > > previous partitioners?
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Fri, Mar 4, 2022 at 1:37 AM David Jacot
> > >  > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Artem,
> > > > > > >
> > > > > > > Thanks for the KIP! I have a few comments:
> > &g

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2022-03-10 Thread Artem Livshits
Hi Jun,

30.  Clarified.

31. I plan to do some benchmarking once implementation is finished, I'll
update the KIP with the results once I have them.  The reason to make it
default is that it won't be used otherwise and we won't know if it's good
or not in practical workloads.

-Artem

On Thu, Mar 10, 2022 at 11:42 AM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the updated KIP. A couple of more comments.
>
> 30. For the 3 new configs, it would be useful to make it clear that they
> are only relevant when the partitioner class is null.
>
> 31. partitioner.adaptive.partitioning.enable : I am wondering whether it
> should default to true. This is a more complex behavior than "uniform
> sticky" and may take some time to get right. If we do want to enable it by
> default, it would be useful to validate it with some test results.
>
> Jun
>
>
>
>
> On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
>  wrote:
>
> > Thank you for feedback, I've discussed this offline with some of the
> folks
> > and updated the KIP.  The main change is that now instead of using
> > DefaultPartitioner and UniformStickyPartitioners as flags, in the new
> > proposal the default partitioner is null, so if no custom partitioner is
> > specified then the partitioning logic is implemented in KafkaProducer.
> > Compatibility section is updated as well.  Also the configuration options
> > are renamed to be more consistent.
> >
> > -Artem
> >
> > On Fri, Mar 4, 2022 at 10:38 PM Luke Chen  wrote:
> >
> > > Hi Artem,
> > >
> > > Thanks for your explanation and update to the KIP.
> > > Some comments:
> > >
> > > 5. In the description for `enable.adaptive.partitioning`, the `false`
> > case,
> > > you said:
> > > > the producer will try to distribute messages uniformly.
> > > I think we should describe the possible skewing distribution.
> Otherwise,
> > > user might be confused about why adaptive partitioning is important.
> > >
> > > 6. In the description for `partition.availability.timeout.ms`, I think
> > we
> > > should mention in the last sentence about if
> > `enable.adaptive.partitioning`
> > > is disabled this logic is also disabled.
> > >
> > > 7. Similar thoughts as Ismael, I think we should have a POC and test to
> > > prove that this adaptive partitioning algorithm can have better uniform
> > > partitioning, compared with original sticky one.
> > >
> > > Thank you.
> > > Luke
> > >
> > > On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma  wrote:
> > >
> > > > Regarding `3`, we should only deprecate it if we're sure the new
> > approach
> > > > handles all cases better. Are we confident about that for both of the
> > > > previous partitioners?
> > > >
> > > > Ismael
> > > >
> > > > On Fri, Mar 4, 2022 at 1:37 AM David Jacot
>  > >
> > > > wrote:
> > > >
> > > > > Hi Artem,
> > > > >
> > > > > Thanks for the KIP! I have a few comments:
> > > > >
> > > > > 1. In the preamble of the proposed change section, there is still a
> > > > > mention of the
> > > > > -1 approach. My understanding is that we have moved away from it
> now.
> > > > >
> > > > > 2. I am a bit concerned by the trick suggested about the
> > > > > DefaultPartitioner and
> > > > > the UniformStickyPartitioner. I do agree that implementing the
> logic
> > in
> > > > the
> > > > > producer itself is a good thing. However, it is weird from a user
> > > > > perspective
> > > > > that he can set a class as partitioner that is not used in the
> end. I
> > > > > think that
> > > > > this will be confusing for our users. Have we considered changing
> the
> > > > > default
> > > > > value of partitioner.class to null to indicate that the new
> built-in
> > > > > partitioner
> > > > > must be used? By default, the built-in partitioner would be used
> > unless
> > > > the
> > > > > user explicitly specify one. The downside is that the new default
> > > > behavior
> > > > > would not work if the user explicitly specify the partitioner but
> we
> > > > could
> > > > > mitigate this with my next point.
> > > > >
> > > > > 3. Related to the pre

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2022-03-09 Thread Artem Livshits
Thank you for feedback, I've discussed this offline with some of the folks
and updated the KIP.  The main change is that now instead of using
DefaultPartitioner and UniformStickyPartitioners as flags, in the new
proposal the default partitioner is null, so if no custom partitioner is
specified then the partitioning logic is implemented in KafkaProducer.
Compatibility section is updated as well.  Also the configuration options
are renamed to be more consistent.

-Artem

On Fri, Mar 4, 2022 at 10:38 PM Luke Chen  wrote:

> Hi Artem,
>
> Thanks for your explanation and update to the KIP.
> Some comments:
>
> 5. In the description for `enable.adaptive.partitioning`, the `false` case,
> you said:
> > the producer will try to distribute messages uniformly.
> I think we should describe the possible skewing distribution. Otherwise,
> user might be confused about why adaptive partitioning is important.
>
> 6. In the description for `partition.availability.timeout.ms`, I think we
> should mention in the last sentence about if `enable.adaptive.partitioning`
> is disabled this logic is also disabled.
>
> 7. Similar thoughts as Ismael, I think we should have a POC and test to
> prove that this adaptive partitioning algorithm can have better uniform
> partitioning, compared with original sticky one.
>
> Thank you.
> Luke
>
> On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma  wrote:
>
> > Regarding `3`, we should only deprecate it if we're sure the new approach
> > handles all cases better. Are we confident about that for both of the
> > previous partitioners?
> >
> > Ismael
> >
> > On Fri, Mar 4, 2022 at 1:37 AM David Jacot 
> > wrote:
> >
> > > Hi Artem,
> > >
> > > Thanks for the KIP! I have a few comments:
> > >
> > > 1. In the preamble of the proposed change section, there is still a
> > > mention of the
> > > -1 approach. My understanding is that we have moved away from it now.
> > >
> > > 2. I am a bit concerned by the trick suggested about the
> > > DefaultPartitioner and
> > > the UniformStickyPartitioner. I do agree that implementing the logic in
> > the
> > > producer itself is a good thing. However, it is weird from a user
> > > perspective
> > > that he can set a class as partitioner that is not used in the end. I
> > > think that
> > > this will be confusing for our users. Have we considered changing the
> > > default
> > > value of partitioner.class to null to indicate that the new built-in
> > > partitioner
> > > must be used? By default, the built-in partitioner would be used unless
> > the
> > > user explicitly specify one. The downside is that the new default
> > behavior
> > > would not work if the user explicitly specify the partitioner but we
> > could
> > > mitigate this with my next point.
> > >
> > > 3. Related to the previous point, I think that we could deprecate both
> > the
> > > DefaultPartitioner and the UniformStickyPartitioner. I would also add a
> > > warning if one of them is explicitly provided by the user to inform
> them
> > > that they should switch to the new built-in one. I am pretty sure that
> > most
> > > of the folks use the default configuration anyway.
> > >
> > > 4. It would be great if we could explain why the -1 way was rejected.
> At
> > > the moment, the rejected alternative only explain the idea but does not
> > > say why we rejected it.
> > >
> > > Best,
> > > David
> > >
> > > On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
> > >  wrote:
> > > >
> > > > Hi Jun,
> > > >
> > > > 2. Removed the option from the KIP.  Now the sticky partitioning
> > > threshold
> > > > is hardcoded to batch.size.
> > > >
> > > > 20. Added the corresponding wording to the KIP.
> > > >
> > > > -Artem
> > > >
> > > > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao 
> > > wrote:
> > > >
> > > > > Hi, Artem,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > 1. Sounds good.
> > > > >
> > > > > 2. If we don't expect users to change it, we probably could just
> > leave
> > > out
> > > > > the new config. In general, it's easy to add a new config, but hard
> > to
> > > > > remove an existing config.
> > > > >
> > > > > 20. The two new configs

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2022-03-03 Thread Artem Livshits
Hi Jun,

2. Removed the option from the KIP.  Now the sticky partitioning threshold
is hardcoded to batch.size.

20. Added the corresponding wording to the KIP.

-Artem

On Thu, Mar 3, 2022 at 10:52 AM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply.
>
> 1. Sounds good.
>
> 2. If we don't expect users to change it, we probably could just leave out
> the new config. In general, it's easy to add a new config, but hard to
> remove an existing config.
>
> 20. The two new configs enable.adaptive.partitioning and
> partition.availability.timeout.ms only apply to the two built-in
> partitioners DefaultPartitioner and UniformStickyPartitioner, right? It
> would be useful to document that in the KIP.
>
> Thanks,
>
> Jun
>
> On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > Thank you for the suggestions.
> >
> > 1. As we discussed offline, we can hardcode the logic for
> > DefaultPartitioner and UniformStickyPartitioner in the KafkaProducer
> (i.e.
> > the DefaultPartitioner.partition won't get called, instead KafkaProducer
> > would check if the partitioner is an instance of DefaultPartitioner and
> > then run the actual partitioning logic itself).  Then the change to the
> > Partitioner wouldn't be required.  I'll update the KIP to reflect that.
> >
> > 2. I don't expect users to change this too often, as changing it would
> > require a bit of studying of the production patterns.  As a general
> > principle, if I can think of a model that requires a deviation from
> > default, I tend to add a configuration option.  It could be that it'll
> > never get used in practice, but I cannot prove that.  I'm ok with
> removing
> > the option, let me know what you think.
> >
> > -Artem
> >
> > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao 
> wrote:
> >
> > > Hi, Artem,
> > >
> > > Thanks for the reply. A few more comments.
> > >
> > > 1. Since we control the implementation and the usage of
> > DefaultPartitioner,
> > > another way is to instantiate the DefaultPartitioner with a special
> > > constructor, which allows it to have more access to internal
> information.
> > > Then we could just change the behavior of  DefaultPartitioner such that
> > it
> > > can use the internal infoamtion when choosing the partition. This seems
> > > more intuitive than having DefaultPartitioner return -1 partition.
> > >
> > > 2. I guess partitioner.sticky.batch.size is introduced because the
> > > effective batch size could be less than batch.size and we want to align
> > > partition switching with the effective batch size. How would a user
> know
> > > the effective batch size to set partitioner.sticky.batch.size properly?
> > If
> > > the user somehow knows the effective batch size, does setting
> batch.size
> > to
> > > the effective batch size achieve the same result?
> > >
> > > 4. Thanks for the explanation. Makes sense to me.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > >  wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > 1. Updated the KIP to add a couple paragraphs about implementation
> > > > necessities in the Proposed Changes section.
> > > >
> > > > 2. Sorry if my reply was confusing, what I meant to say (and I
> > elaborated
> > > > on that in point #3) is that there could be patterns for which 16KB
> > > > wouldn't be the most effective setting, thus it would be good to make
> > it
> > > > configurable.
> > > >
> > > > 4. We could use broker readiness timeout.  But I'm not sure it would
> > > > correctly model the broker load.  The problem is that latency is not
> an
> > > > accurate measure of throughput, we could have 2 brokers that have
> equal
> > > > throughput but one has higher latency (so it takes larger batches
> less
> > > > frequently, but still takes the same load).  Latency-based logic is
> > > likely
> > > > to send less data to the broker with higher latency.  Using the queue
> > > size
> > > > would adapt to throughput, regardless of latency (which could be
> just a
> > > > result of deployment topology), so that's the model chosen in the
> > > > proposal.  The partition.availability.timeout.ms logic approaches
> the
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2022-03-03 Thread Artem Livshits
Hi Jun,

Thank you for the suggestions.

1. As we discussed offline, we can hardcode the logic for
DefaultPartitioner and UniformStickyPartitioner in the KafkaProducer (i.e.
the DefaultPartitioner.partition won't get called, instead KafkaProducer
would check if the partitioner is an instance of DefaultPartitioner and
then run the actual partitioning logic itself).  Then the change to the
Partitioner wouldn't be required.  I'll update the KIP to reflect that.

2. I don't expect users to change this too often, as changing it would
require a bit of studying of the production patterns.  As a general
principle, if I can think of a model that requires a deviation from
default, I tend to add a configuration option.  It could be that it'll
never get used in practice, but I cannot prove that.  I'm ok with removing
the option, let me know what you think.

-Artem

On Mon, Feb 28, 2022 at 2:06 PM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply. A few more comments.
>
> 1. Since we control the implementation and the usage of DefaultPartitioner,
> another way is to instantiate the DefaultPartitioner with a special
> constructor, which allows it to have more access to internal information.
> Then we could just change the behavior of  DefaultPartitioner such that it
> can use the internal infoamtion when choosing the partition. This seems
> more intuitive than having DefaultPartitioner return -1 partition.
>
> 2. I guess partitioner.sticky.batch.size is introduced because the
> effective batch size could be less than batch.size and we want to align
> partition switching with the effective batch size. How would a user know
> the effective batch size to set partitioner.sticky.batch.size properly? If
> the user somehow knows the effective batch size, does setting batch.size to
> the effective batch size achieve the same result?
>
> 4. Thanks for the explanation. Makes sense to me.
>
> Thanks,
>
> Jun
>
> Thanks,
>
> Jun
>
> On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > 1. Updated the KIP to add a couple paragraphs about implementation
> > necessities in the Proposed Changes section.
> >
> > 2. Sorry if my reply was confusing, what I meant to say (and I elaborated
> > on that in point #3) is that there could be patterns for which 16KB
> > wouldn't be the most effective setting, thus it would be good to make it
> > configurable.
> >
> > 4. We could use broker readiness timeout.  But I'm not sure it would
> > correctly model the broker load.  The problem is that latency is not an
> > accurate measure of throughput, we could have 2 brokers that have equal
> > throughput but one has higher latency (so it takes larger batches less
> > frequently, but still takes the same load).  Latency-based logic is
> likely
> > to send less data to the broker with higher latency.  Using the queue
> size
> > would adapt to throughput, regardless of latency (which could be just a
> > result of deployment topology), so that's the model chosen in the
> > proposal.  The partition.availability.timeout.ms logic approaches the
> > model
> > from a slightly different angle, say we have a requirement to deliver
> > messages via brokers that have a certain latency, then
> > partition.availability.timeout.ms could be used to tune that.  Latency
> is
> > a
> > much more volatile metric than throughput (latency depends on external
> > load, on capacity, on deployment topology, on jitter in network, on
> jitter
> > in disk, etc.) and I think it would be best to leave latency-based
> > thresholds configurable to tune for the environment.
> >
> > -Artem
> >
> > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao 
> wrote:
> >
> > > Hi, Artem,
> > >
> > > Thanks for the reply. A few more comments.
> > >
> > > 1. Perhaps you could elaborate a bit more on how the producer
> determines
> > > the partition if the partitioner returns -1. This will help understand
> > why
> > > encapsulating that logic as a partitioner is not clean.
> > >
> > > 2. I am not sure that I understand this part. If 15.5KB is more
> > efficient,
> > > could we just set batch.size to 15.5KB?
> > >
> > > 4. Yes, we could add a switch (or a variant of the partitioner) for
> > > enabling this behavior. Also, choosing partitions based on broker
> > readiness
> > > can be made in a smoother way. For example, we could track the last
> time
> > a
> > > broker has drained any batches from the accumulator. We can then select
> > > partitions from brokers proportionally to the inverse of that time.

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2022-02-25 Thread Artem Livshits
Hi Jun,

1. Updated the KIP to add a couple paragraphs about implementation
necessities in the Proposed Changes section.

2. Sorry if my reply was confusing, what I meant to say (and I elaborated
on that in point #3) is that there could be patterns for which 16KB
wouldn't be the most effective setting, thus it would be good to make it
configurable.

4. We could use broker readiness timeout.  But I'm not sure it would
correctly model the broker load.  The problem is that latency is not an
accurate measure of throughput, we could have 2 brokers that have equal
throughput but one has higher latency (so it takes larger batches less
frequently, but still takes the same load).  Latency-based logic is likely
to send less data to the broker with higher latency.  Using the queue size
would adapt to throughput, regardless of latency (which could be just a
result of deployment topology), so that's the model chosen in the
proposal.  The partition.availability.timeout.ms logic approaches the model
from a slightly different angle, say we have a requirement to deliver
messages via brokers that have a certain latency, then
partition.availability.timeout.ms could be used to tune that.  Latency is a
much more volatile metric than throughput (latency depends on external
load, on capacity, on deployment topology, on jitter in network, on jitter
in disk, etc.) and I think it would be best to leave latency-based
thresholds configurable to tune for the environment.

-Artem

On Wed, Feb 23, 2022 at 11:14 AM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply. A few more comments.
>
> 1. Perhaps you could elaborate a bit more on how the producer determines
> the partition if the partitioner returns -1. This will help understand why
> encapsulating that logic as a partitioner is not clean.
>
> 2. I am not sure that I understand this part. If 15.5KB is more efficient,
> could we just set batch.size to 15.5KB?
>
> 4. Yes, we could add a switch (or a variant of the partitioner) for
> enabling this behavior. Also, choosing partitions based on broker readiness
> can be made in a smoother way. For example, we could track the last time a
> broker has drained any batches from the accumulator. We can then select
> partitions from brokers proportionally to the inverse of that time. This
> seems smoother than a cutoff based on a partition.availability.timeout.ms
>  threshold.
>
> Thanks,
>
> Jun
>
> On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
>  wrote:
>
> > Hello Luke, Jun,
> >
> > Thank you for your feedback.  I've added the Rejected Alternative section
> > that may clarify some of the questions w.r.t. returning -1.
> >
> > 1. I've elaborated on the -1 in the KIP.  The problem is that a
> significant
> > part of the logic needs to be in the producer (because it now uses
> > information about brokers that only the producer knows), so encapsulation
> > of the logic within the default partitioner isn't as clean.   I've added
> > the Rejected Alternative section that documents an attempt to keep the
> > encapsulation by providing new callbacks to the partitioner.
> >
> > 2. The meaning of the partitioner.sticky.batch.size is explained in the
> > Uniform Sticky Batch Size section.  Basically, we track the amount of
> bytes
> > produced to the partition and if it exceeds partitioner.sticky.batch.size
> > then we switch to the next partition.  As far as the reason to make it
> > different from batch.size, I think Luke answered this with the question
> #3
> > -- what if the load pattern is such that 15.5KB would be more efficient
> > than 16KB?
> >
> > 3. I think it's hard to have one size that would fit all patterns.  E.g.
> if
> > the load pattern is such that there is linger and the app fills the batch
> > before linger expires, then having 16KB would most likely synchronize
> > batching and partition switching, so each partition would get a full
> > batch.  If load pattern is such that there are a few non-complete batches
> > go out before a larger batch starts to fill, then it may actually be
> > beneficial to make slightly larger (e.g. linger=0, first few records go
> in
> > the first batch, then next few records go to second batch, and so on,
> until
> > 5 in-flight, then larger batch would form while waiting for broker to
> > respond, but the partition switch would happen before the larger batch is
> > full).
> >
> > 4. There are a couple of reasons for introducing
> > partition.availability.timeout.ms.  Luke's an Jun's questions are
> slightly
> > different, so I'm going to separate replies.
> > (Luke) Is the queue size a good enough signal?  I think it's a good
> default
> > signal as it tries to preserve gen

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2022-02-18 Thread Artem Livshits
r.class defaults to DefaultPartitioner, which uses
> StickyPartitioner when the key is specified. Since this KIP improves upon
> StickyPartitioner, it would be useful to make the new behavior the default
> and document that in the KIP.
>
> Thanks,
>
> Jun
>
>
> On Wed, Feb 16, 2022 at 7:30 PM Luke Chen  wrote:
>
> > Hi Artem,
> >
> > Also, one more thing I think you need to know.
> > As this bug KAFKA-7572 <https://issues.apache.org/jira/browse/KAFKA-7572
> >
> > mentioned, sometimes the custom partitioner would return negative
> partition
> > id accidentally.
> > If it returned -1, how could you know if it is expected or not expected?
> >
> > Thanks.
> > Luke
> >
> > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen  wrote:
> >
> > > Hi Artem,
> > >
> > > Thanks for the update. I have some questions about it:
> > >
> > > 1. Could you explain why you need the `partitioner` return -1? In which
> > > case we need it? And how it is used in your KIP?
> > > 2. What does the "partitioner.sticky.batch.size" mean? In the
> > > "Configuration" part, you didn't explain it. And default to 0, I guess
> > it's
> > > the same as current behavior for backward compatibility, right? You
> > should
> > > mention it.
> > > 3. I'm thinking we can have a threshold to the
> > > "partitioner.sticky.batch.size". Let's say, we already accumulate
> 15.5KB
> > in
> > > partition1, and sent. So when next batch created, in your current
> design,
> > > we still stick to partition1, until 16KB reached, and then we create a
> > new
> > > batch to change to next partition, ex: partition2. But I think if we
> set
> > a
> > > threshold to 95% (for example), we can know previous 15.5KB already
> > exceeds
> > > the threshold so that we can directly create new batch for next
> records.
> > > This way should be able to make it more efficient. WDYT?
> > > 4. I think the improved queuing logic should be good enough. I can't
> get
> > > the benefit of having `partition.availability.timeout.ms` config. In
> > > short, you want to make the partitioner take the broker load into
> > > consideration. We can just improve that in the queuing logic (and you
> > > already did it). Why should we add the config? Could you use some
> > examples
> > > to explain why we need it.
> > >
> > > Thank you.
> > > Luke
> > >
> > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > >  wrote:
> > >
> > >> Hello,
> > >>
> > >> Please add your comments about the KIP.  If there are no
> considerations,
> > >> I'll put it up for vote in the next few days.
> > >>
> > >> -Artem
> > >>
> > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits  >
> > >> wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > After trying a few prototypes, I've made some changes to the public
> > >> > interface.  Please see the updated document
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > >> > .
> > >> >
> > >> > -Artem
> > >> >
> > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> > alivsh...@confluent.io>
> > >> > wrote:
> > >> >
> > >> >> Hello,
> > >> >>
> > >> >> This is the discussion thread for
> > >> >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > >> >> .
> > >> >>
> > >> >> The proposal is a bug fix for
> > >> >> https://issues.apache.org/jira/browse/KAFKA-10888, but it does
> > >> include a
> > >> >> client config change, therefore we have a KIP to discuss.
> > >> >>
> > >> >> -Artem
> > >> >>
> > >> >
> > >>
> > >
> >
>


Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2022-02-15 Thread Artem Livshits
Hello,

Please add your comments about the KIP.  If there are no considerations,
I'll put it up for vote in the next few days.

-Artem

On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits 
wrote:

> Hello,
>
> After trying a few prototypes, I've made some changes to the public
> interface.  Please see the updated document
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> .
>
> -Artem
>
> On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits 
> wrote:
>
>> Hello,
>>
>> This is the discussion thread for
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
>> .
>>
>> The proposal is a bug fix for
>> https://issues.apache.org/jira/browse/KAFKA-10888, but it does include a
>> client config change, therefore we have a KIP to discuss.
>>
>> -Artem
>>
>


Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2022-02-07 Thread Artem Livshits
Hello,

After trying a few prototypes, I've made some changes to the public
interface.  Please see the updated document
https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
.

-Artem

On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits 
wrote:

> Hello,
>
> This is the discussion thread for
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> .
>
> The proposal is a bug fix for
> https://issues.apache.org/jira/browse/KAFKA-10888, but it does include a
> client config change, therefore we have a KIP to discuss.
>
> -Artem
>


[jira] [Resolved] (KAFKA-13540) UniformStickyPartitioner leads to uneven Kafka partitions

2021-12-13 Thread Artem Livshits (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Livshits resolved KAFKA-13540.

Resolution: Duplicate

See also 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner

> UniformStickyPartitioner leads to uneven Kafka partitions
> -
>
> Key: KAFKA-13540
> URL: https://issues.apache.org/jira/browse/KAFKA-13540
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 2.4.1
>Reporter: nk2242696
>Priority: Major
> Attachments: MicrosoftTeams-image (1).png
>
>
> Kafka Topic with 20 partitions, 24 hour TTL. Replication factor of 3 . 
> Using UniformStickyPartitioner expected size of each partition to be roughly 
> of same size But realised size for some of the partitions is almost double .
> !MicrosoftTeams-image (1).png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] KIP-782: Expandable batch size in producer

2021-12-08 Thread Artem Livshits
Hi Jun,

11. That was my initial thinking as well, but in a discussion some people
pointed out the change of behavior in some scenarios.  E.g. if someone for
some reason really wants batches to be at least 16KB and sets large
linger.ms, and most of the time the batches are filled quickly enough and
they observe a certain latency.  Then they upgrade their client with a
default size 256KB and the latency increases.  This could be seen as a
regression.  It could be fixed by just reducing linger.ms to specify the
expected latency, but still could be seen as a disruption by some users.
The other reason to have 2 sizes is to avoid allocating large buffers
upfront.

-Artem

On Wed, Dec 8, 2021 at 3:07 PM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the reply.
>
> 11. Got it. To me, batch.size is really used for throughput and not for
> latency guarantees. There is no guarantee when 16KB will be accumulated.
> So, if users want any latency guarantee, they will need to specify
> linger.ms accordingly.
> Then, batch.size can just be used to tune for throughput.
>
> 20. Could we also describe the unit of compression? Is
> it batch.initial.size, batch.size or batch.max.size?
>
> Thanks,
>
> Jun
>
> On Wed, Dec 8, 2021 at 9:58 AM Artem Livshits
>  wrote:
>
> > Hi Jun,
> >
> > 10. My understanding is that MemoryRecords would under the covers be
> > allocated in chunks, so logically it still would be one MemoryRecords
> > object, it's just instead of allocating one large chunk upfront, smaller
> > chunks are allocated as needed to grow the batch and linked into a list.
> >
> > 11. The reason for 2 sizes is to avoid change of behavior when triggering
> > batch send with large linger.ms.  Currently, a batch send is triggered
> > once
> > the batch reaches 16KB by default, if we just raise the default to 256KB,
> > then the batch send will be delayed.  Using a separate value would allow
> > keeping the current behavior when sending the batch out, but provide
> better
> > throughput with high latency + high bandwidth channels.
> >
> > -Artem
> >
> > On Tue, Dec 7, 2021 at 5:29 PM Jun Rao  wrote:
> >
> > > Hi, Luke,
> > >
> > > Thanks for the KIP.  A few comments below.
> > >
> > > 10. Accumulating small batches could improve memory usage. Will that
> > > introduce extra copying when generating a produce request? Currently, a
> > > produce request takes a single MemoryRecords per partition.
> > > 11. Do we need to introduce a new config batch.max.size? Could we just
> > > increase the default of batch.size? We probably need to have KIP-794
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > >
> > > resolved
> > > before increasing the default batch size since the larger the batch
> size,
> > > the worse the problem in KIP-794.
> > > 12. As for max.request.size, currently it's used for both the max
> record
> > > size and the max request size, which is unintuitive. Perhaps we could
> > > introduce a new config max.record.size that defaults to 1MB. We could
> > then
> > > increase max.request.size to sth like 10MB.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Mon, Nov 29, 2021 at 6:02 PM Artem Livshits
> > >  wrote:
> > >
> > > > Hi Luke,
> > > >
> > > > I don't mind increasing the max.request.size to a higher number, e.g.
> > 2MB
> > > > could be good.  I think we should also run some benchmarks to see the
> > > > effects of different sizes.
> > > >
> > > > I agree that changing round robin to random solves an independent
> > > existing
> > > > issue, however the logic in this KIP exacerbates the issue, so there
> is
> > > > some dependency.
> > > >
> > > > -Artem
> > > >
> > > > On Wed, Nov 24, 2021 at 12:43 AM Luke Chen 
> wrote:
> > > >
> > > > > Hi Artem,
> > > > > Yes, I agree if we go with random selection instead of round-robin
> > > > > selection, the latency issue will be more fair. That is, if there
> are
> > > 10
> > > > > partitions, the 10th partition will always be the last choice in
> each
> > > > round
> > > > > in current design, but with random selection, the chance to be
> > selected
> > > > is
> > > > > more fair.
> > > > >
> >

Re: [DISCUSS] KIP-782: Expandable batch size in producer

2021-12-08 Thread Artem Livshits
Hi Jun,

10. My understanding is that MemoryRecords would under the covers be
allocated in chunks, so logically it still would be one MemoryRecords
object, it's just instead of allocating one large chunk upfront, smaller
chunks are allocated as needed to grow the batch and linked into a list.

11. The reason for 2 sizes is to avoid change of behavior when triggering
batch send with large linger.ms.  Currently, a batch send is triggered once
the batch reaches 16KB by default, if we just raise the default to 256KB,
then the batch send will be delayed.  Using a separate value would allow
keeping the current behavior when sending the batch out, but provide better
throughput with high latency + high bandwidth channels.

-Artem

On Tue, Dec 7, 2021 at 5:29 PM Jun Rao  wrote:

> Hi, Luke,
>
> Thanks for the KIP.  A few comments below.
>
> 10. Accumulating small batches could improve memory usage. Will that
> introduce extra copying when generating a produce request? Currently, a
> produce request takes a single MemoryRecords per partition.
> 11. Do we need to introduce a new config batch.max.size? Could we just
> increase the default of batch.size? We probably need to have KIP-794
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> >
> resolved
> before increasing the default batch size since the larger the batch size,
> the worse the problem in KIP-794.
> 12. As for max.request.size, currently it's used for both the max record
> size and the max request size, which is unintuitive. Perhaps we could
> introduce a new config max.record.size that defaults to 1MB. We could then
> increase max.request.size to sth like 10MB.
>
> Thanks,
>
> Jun
>
>
> On Mon, Nov 29, 2021 at 6:02 PM Artem Livshits
>  wrote:
>
> > Hi Luke,
> >
> > I don't mind increasing the max.request.size to a higher number, e.g. 2MB
> > could be good.  I think we should also run some benchmarks to see the
> > effects of different sizes.
> >
> > I agree that changing round robin to random solves an independent
> existing
> > issue, however the logic in this KIP exacerbates the issue, so there is
> > some dependency.
> >
> > -Artem
> >
> > On Wed, Nov 24, 2021 at 12:43 AM Luke Chen  wrote:
> >
> > > Hi Artem,
> > > Yes, I agree if we go with random selection instead of round-robin
> > > selection, the latency issue will be more fair. That is, if there are
> 10
> > > partitions, the 10th partition will always be the last choice in each
> > round
> > > in current design, but with random selection, the chance to be selected
> > is
> > > more fair.
> > >
> > > However, I think that's kind of out of scope with this KIP. This is an
> > > existing issue, and it might need further discussion to decide if this
> > > change is necessary.
> > >
> > > I agree the default 32KB for "batch.max.size" might be not huge
> > improvement
> > > compared with 256KB. I'm thinking, maybe default to "64KB" for
> > > "batch.max.size", and make the documentation clear that if the
> > > "batch.max.size"
> > > is increased, there might be chances that the "ready" partitions need
> to
> > > wait for next request to send to broker, because of the
> > "max.request.size"
> > > (default 1MB) limitation. "max.request.size" can also be considered to
> > > increase to avoid this issue. What do you think?
> > >
> > > Thank you.
> > > Luke
> > >
> > > On Wed, Nov 24, 2021 at 2:26 AM Artem Livshits
> > >  wrote:
> > >
> > > > >  maybe I can firstly decrease the "batch.max.size" to 32KB
> > > >
> > > > I think 32KB is too small.  With 5 in-flight and 100ms latency we can
> > > > produce 1.6MB/s per partition.  With 256KB we can produce 12.8MB/s
> per
> > > > partition.  We should probably set up some testing and see if 256KB
> has
> > > > problems.
> > > >
> > > > To illustrate latency dynamics, let's consider a simplified model: 1
> > > > in-flight request per broker, produce latency 125ms, 256KB max
> request
> > > > size, 16 partitions assigned to the same broker, every second 128KB
> is
> > > > produced to each partition (total production rate is 2MB/sec).
> > > >
> > > > If the batch size is 16KB, then the pattern would be the following:
> > > >
> > > > 0ms - produce 128KB into each partition
> > > > 0ms - take 16KB fr

Re: [DISCUSS] KIP-782: Expandable batch size in producer

2021-11-29 Thread Artem Livshits
Hi Luke,

I don't mind increasing the max.request.size to a higher number, e.g. 2MB
could be good.  I think we should also run some benchmarks to see the
effects of different sizes.

I agree that changing round robin to random solves an independent existing
issue, however the logic in this KIP exacerbates the issue, so there is
some dependency.

-Artem

On Wed, Nov 24, 2021 at 12:43 AM Luke Chen  wrote:

> Hi Artem,
> Yes, I agree if we go with random selection instead of round-robin
> selection, the latency issue will be more fair. That is, if there are 10
> partitions, the 10th partition will always be the last choice in each round
> in current design, but with random selection, the chance to be selected is
> more fair.
>
> However, I think that's kind of out of scope with this KIP. This is an
> existing issue, and it might need further discussion to decide if this
> change is necessary.
>
> I agree the default 32KB for "batch.max.size" might be not huge improvement
> compared with 256KB. I'm thinking, maybe default to "64KB" for
> "batch.max.size", and make the documentation clear that if the
> "batch.max.size"
> is increased, there might be chances that the "ready" partitions need to
> wait for next request to send to broker, because of the "max.request.size"
> (default 1MB) limitation. "max.request.size" can also be considered to
> increase to avoid this issue. What do you think?
>
> Thank you.
> Luke
>
> On Wed, Nov 24, 2021 at 2:26 AM Artem Livshits
>  wrote:
>
> > >  maybe I can firstly decrease the "batch.max.size" to 32KB
> >
> > I think 32KB is too small.  With 5 in-flight and 100ms latency we can
> > produce 1.6MB/s per partition.  With 256KB we can produce 12.8MB/s per
> > partition.  We should probably set up some testing and see if 256KB has
> > problems.
> >
> > To illustrate latency dynamics, let's consider a simplified model: 1
> > in-flight request per broker, produce latency 125ms, 256KB max request
> > size, 16 partitions assigned to the same broker, every second 128KB is
> > produced to each partition (total production rate is 2MB/sec).
> >
> > If the batch size is 16KB, then the pattern would be the following:
> >
> > 0ms - produce 128KB into each partition
> > 0ms - take 16KB from each partition send (total 256KB)
> > 125ms - complete first 16KB from each partition, send next 16KB
> > 250ms - complete second 16KB, send next 16KB
> > ...
> > 1000ms - complete 8th 16KB from each partition
> >
> > from this model it's easy to see that there are 256KB that are sent
> > immediately, 256KB that are sent in 125ms, ... 256KB that are sent in
> > 875ms.
> >
> > If the batch size is 256KB, then the pattern would be the following:
> >
> > 0ms - produce 128KB into each partition
> > 0ms - take 128KB each from first 2 partitions and send (total 256KB)
> > 125ms - complete 2 first partitions, send data from next 2 partitions
> > ...
> > 1000ms - complete last 2 partitions
> >
> > even though the pattern is different, there are still 256KB that are sent
> > immediately, 256KB that are sent in 125ms, ... 256KB that are sent in
> > 875ms.
> >
> > Now, in this example if we do strictly round-robin (current
> implementation)
> > and we have this exact pattern (not sure how often such regular pattern
> > would happen in practice -- I would expect that it would be a bit more
> > random), some partitions would experience higher latency than others (not
> > sure how much it would matter in practice -- in the end of the day some
> > bytes produced to a topic would have higher latency and some bytes would
> > have lower latency).  This pattern is easily fixed by choosing the next
> > partition randomly instead of using round-robin.
> >
> > -Artem
> >
> > On Tue, Nov 23, 2021 at 12:08 AM Luke Chen  wrote:
> >
> > > Hi Tom,
> > > Thanks for your comments. And thanks for Artem's explanation.
> > > Below is my response:
> > >
> > > > Currently because buffers are allocated using batch.size it means we
> > can
> > > handle records that are that large (e.g. one big record per batch).
> > Doesn't
> > > the introduction of smaller buffer sizes (batch.initial.size) mean a
> > > corresponding decrease in the maximum record size that the producer can
> > > handle?
> > >
> > > Actually, the "batch.size" is only like a threshold to decide if the
> > batch
> > > is "ready to be sent". That is, even if you set the &qu

Re: [DISCUSS] KIP-782: Expandable batch size in producer

2021-11-23 Thread Artem Livshits
>  maybe I can firstly decrease the "batch.max.size" to 32KB

I think 32KB is too small.  With 5 in-flight and 100ms latency we can
produce 1.6MB/s per partition.  With 256KB we can produce 12.8MB/s per
partition.  We should probably set up some testing and see if 256KB has
problems.

To illustrate latency dynamics, let's consider a simplified model: 1
in-flight request per broker, produce latency 125ms, 256KB max request
size, 16 partitions assigned to the same broker, every second 128KB is
produced to each partition (total production rate is 2MB/sec).

If the batch size is 16KB, then the pattern would be the following:

0ms - produce 128KB into each partition
0ms - take 16KB from each partition send (total 256KB)
125ms - complete first 16KB from each partition, send next 16KB
250ms - complete second 16KB, send next 16KB
...
1000ms - complete 8th 16KB from each partition

from this model it's easy to see that there are 256KB that are sent
immediately, 256KB that are sent in 125ms, ... 256KB that are sent in 875ms.

If the batch size is 256KB, then the pattern would be the following:

0ms - produce 128KB into each partition
0ms - take 128KB each from first 2 partitions and send (total 256KB)
125ms - complete 2 first partitions, send data from next 2 partitions
...
1000ms - complete last 2 partitions

even though the pattern is different, there are still 256KB that are sent
immediately, 256KB that are sent in 125ms, ... 256KB that are sent in 875ms.

Now, in this example if we do strictly round-robin (current implementation)
and we have this exact pattern (not sure how often such regular pattern
would happen in practice -- I would expect that it would be a bit more
random), some partitions would experience higher latency than others (not
sure how much it would matter in practice -- in the end of the day some
bytes produced to a topic would have higher latency and some bytes would
have lower latency).  This pattern is easily fixed by choosing the next
partition randomly instead of using round-robin.

-Artem

On Tue, Nov 23, 2021 at 12:08 AM Luke Chen  wrote:

> Hi Tom,
> Thanks for your comments. And thanks for Artem's explanation.
> Below is my response:
>
> > Currently because buffers are allocated using batch.size it means we can
> handle records that are that large (e.g. one big record per batch). Doesn't
> the introduction of smaller buffer sizes (batch.initial.size) mean a
> corresponding decrease in the maximum record size that the producer can
> handle?
>
> Actually, the "batch.size" is only like a threshold to decide if the batch
> is "ready to be sent". That is, even if you set the "batch.size=16KB"
> (default value), users can still send one record sized with 20KB, as long
> as the size is less than "max.request.size" in producer (default 1MB).
> Therefore, the introduction of "batch.initial.size" won't decrease the
> maximum record size that the producer can handle.
>
> > But isn't there the risk that drainBatchesForOneNode would end up not
> sending ready
> batches well past when they ought to be sent (according to their linger.ms
> ),
> because it's sending buffers for earlier partitions too aggressively?
>
> Did you mean that we have a "max.request.size" per request (default is
> 1MB), and before this KIP, the request can include 64 batches in single
> request ["batch.size"(16KB) * 64 = 1MB], but now, we might be able to
> include 32 batches or less, because we aggressively sent more records in
> one batch, is that what you meant? That's a really good point that I've
> never thought about. I think your suggestion to go through other partitions
> that just fit "batch.size", or expire "linger.ms" first, before handling
> the one that is > "batch.size" limit is not a good way, because it might
> cause the one with size > "batch.size" always in the lowest priority, and
> cause starving issue that the batch won't have chance to get sent.
>
> I don't have better solution for it, but maybe I can firstly decrease the
> "batch.max.size" to 32KB, instead of aggressively 256KB in the KIP. That
> should alleviate the problem. And still improve the throughput. What do you
> think?
>
> Thank you.
> Luke
>
> On Tue, Nov 23, 2021 at 9:04 AM Artem Livshits
>  wrote:
>
> > > I think this KIP would change the behaviour of producers when there are
> > multiple partitions ready to be sent
> >
> > This is correct, the pattern changes and becomes more coarse-grained.
> But
> > I don't think it changes fairness over the long run.  I think it's a good
> > idea to change drainIndex to be random rather than round robin to avoid
> > forming patterns where some partitions would consistently g

Re: [DISCUSS] KIP-782: Expandable batch size in producer

2021-11-22 Thread Artem Livshits
> I think this KIP would change the behaviour of producers when there are
multiple partitions ready to be sent

This is correct, the pattern changes and becomes more coarse-grained.  But
I don't think it changes fairness over the long run.  I think it's a good
idea to change drainIndex to be random rather than round robin to avoid
forming patterns where some partitions would consistently get higher
latencies than others because they wait longer for their turn.

If we really wanted to preserve the exact patterns, we could either try to
support multiple 16KB batches from one partition per request (probably
would require protocol change to change logic on the broker for duplicate
detection) or try to re-batch 16KB batches from accumulator into larger
batches during send (additional computations) or try to consider all
partitions assigned to a broker to check if a new batch needs to be created
(i.e. compare cumulative batch size from all partitions assigned to a
broker and create new batch when cumulative size is 1MB, more complex).

Overall, it seems like just increasing the max batch size is a simpler
solution and it does favor larger batch sizes, which is beneficial not just
for production.

> ready batches well past when they ought to be sent (according to their
linger.ms)

The trigger for marking batches ready to be sent isn't changed - a batch is
ready to be sent once it reaches 16KB, so by the time larger batches start
forming, linger.ms wouldn't matter much because the batching goal is met
and the batch can be sent immediately.  Larger batches start forming once
the client starts waiting for the server, in which case some data will wait
its turn to be sent.  This will happen for some data regardless of how we
pick data to send, the question is just whether we'd have some scenarios
where some partitions would consistently experience higher latency than
others.  I think picking drainIndex randomly would prevent such scenarios.

-Artem

On Mon, Nov 22, 2021 at 2:28 AM Tom Bentley  wrote:

> Hi Luke,
>
> Thanks for the KIP!
>
> Currently because buffers are allocated using batch.size it means we can
> handle records that are that large (e.g. one big record per batch). Doesn't
> the introduction of smaller buffer sizes (batch.initial.size) mean a
> corresponding decrease in the maximum record size that the producer can
> handle? That might not be a problem if the user knows their maximum record
> size and has tuned batch.initial.size accordingly, but if the default for
> batch.initial.size < batch.size it could cause regressions for existing
> users with a large record size, I think. It should be enough for
> batch.initial.size to default to batch.size, allowing users who care about
> the memory saving in the off-peak throughput case to do the tuning, but not
> causing a regression for existing users.
>
> I think this KIP would change the behaviour of producers when there are
> multiple partitions ready to be sent: By sending all the ready buffers
> (which may now be > batch.size) for the first partition, we could end up
> excluding ready buffers for other partitions from the current send. In
> other words, as I understand the KIP currently, there's a change in
> fairness. I think the code in RecordAccumulator#drainBatchesForOneNode will
> ensure fairness in the long run, because the drainIndex will ensure that
> those other partitions each get their turn at being the first. But isn't
> there the risk that drainBatchesForOneNode would end up not sending ready
> batches well past when they ought to be sent (according to their linger.ms
> ),
> because it's sending buffers for earlier partitions too aggressively? Or,
> to put it another way, perhaps the RecordAccumulator should round-robin the
> ready buffers for _all_ the partitions before trying to fill the remaining
> space with the extra buffers (beyond the batch.size limit) for the first
> partitions?
>
> Kind regards,
>
> Tom
>
> On Wed, Oct 20, 2021 at 1:35 PM Luke Chen  wrote:
>
> > Hi Ismael and all devs,
> > Is there any comments/suggestions to this KIP?
> > If no, I'm going to update the KIP based on my previous mail, and start a
> > vote tomorrow or next week.
> >
> > Thank you.
> > Luke
> >
> > On Mon, Oct 18, 2021 at 2:40 PM Luke Chen  wrote:
> >
> > > Hi Ismael,
> > > Thanks for your comments.
> > >
> > > 1. Why do we have to reallocate the buffer? We can keep a list of
> buffers
> > > instead and avoid reallocation.
> > > -> Do you mean we allocate multiple buffers with "buffer.initial.size",
> > > and link them together (with linked list)?
> > > ex:
> > > a. We allocate 4KB initial buffer
> > > | 4KB |
> > >
> > > b. when new records reached and the remaining buffer is not enough for
> > the
> > > records, we create another batch with "batch.initial.size" buffer
> > > ex: we already have 3KB of data in the 1st buffer, and here comes the
> 2KB
> > > record
> > >
> > > | 4KB (1KB remaining) |
> > > now, record: 2KB coming
> > > We fill the 1st 

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2021-11-19 Thread Artem Livshits
Hello,

During implementation it turned out that the existing Partitioner.partition
method doesn't have enough arguments to accurately estimate record size in
bytes (e.g. it doesn't have headers, cannot take compression into account,
etc.).  So I'm proposing to add a new Partitioner.partition method that
takes a callback that can be used to estimate record size.

I've updated the KIP correspondingly
https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner

-Artem

On Mon, Nov 8, 2021 at 5:42 PM Artem Livshits 
wrote:

> Hi Luke, Justine,
>
> Thank you for feedback and questions. I've added clarification to the KIP.
>
> > there will be some period of time the distribution is not even.
>
> That's correct.  There would be a small temporary imbalance, but over time
> the distribution should be uniform.
>
> > 1. This paragraph is a little confusing, because there's no "batch mode"
> or "non-batch mode", right?
>
> Updated the wording to not use "batch mode"
>
> > 2. In motivation, you mentioned 1 drawback of current
> UniformStickyPartitioner is
>
> The problem with the current implementation is that it switches once a new
> batch is created which may happen after the first record when linger.ms=0.
> The new implementation won't switch after the batch, so even if the first
> record got sent out in a batch, the second record would be produced to the
> same partition.  Once we have 5 batches in-flight, the new records will
> pile up in the accumulator.
>
> > I was curious about how the logic automatically switches here.
>
> Added some clarifications to the KIP.  Basically, because we can only have
> 5 in-flight batches, as soon as the first 5 are in-flight, the records
> start piling up in the accumulator.  If the rate is low, records get sent
> quickly (e.g. if we have latency 50ms, and produce less than 20 rec / sec,
> then each record will often get sent in its own batch, because a batch
> would often complete before a new record arrives).  If the rate is high,
> then the first few records get sent quickly, but then records will batch
> together until one of the in-flight batches completes, the higher the rate
> is (or the higher latency is), the larger the batches are.
>
> This is not a new logic, btw, this is how it works now, the new logic just
> helps to utilize this better by giving the partition an opportunity to hit
> 5 in-flight and start accumulating sooner.  KIP-782 will make this even
> better, so batches could also grow beyond 16KB if production rate is high.
>
> -Artem
>
>
> On Mon, Nov 8, 2021 at 11:56 AM Justine Olshan
>  wrote:
>
>> Hi Artem,
>> Thanks for working on improving the Sticky Partitioner!
>>
>> I had a few questions about this portion:
>>
>> *The batching will continue until either an in-flight batch completes or
>> we
>> hit the N bytes and move to the next partition.  This way it takes just 5
>> records to get to batching mode, not 5 x number of partition records, and
>> the batching mode will stay longer as we'll be batching while waiting for
>> a
>> request to be completed.  As the production rate accelerates, the logic
>> will automatically switch to use larger batches to sustain higher
>> throughput.*
>>
>> *If one of the brokers has higher latency the records for the partitions
>> hosted on that broker are going to form larger batches, but it's still
>> going to be the same *amount records* sent less frequently in larger
>> batches, the logic automatically adapts to that.*
>>
>> I was curious about how the logic automatically switches here. It seems
>> like we are just adding *partitioner.sticky.batch.size *which seems like a
>> static value. Can you go into more detail about this logic? Or clarify
>> something I may have missed.
>>
>> On Mon, Nov 8, 2021 at 1:34 AM Luke Chen  wrote:
>>
>> > Thanks Artem,
>> > It's much better now.
>> > I've got your idea. In KIP-480: Sticky Partitioner, we'll change
>> partition
>> > (call partitioner) when either 1 of below condition match
>> > 1. the batch is full
>> > 2. when linger.ms is up
>> > But, you are changing the definition, into a
>> > "partitioner.sticky.batch.size" size is reached.
>> >
>> > It'll fix the uneven distribution issue, because we did the sent out
>> size
>> > calculation in the producer side.
>> > But it might have another issue that when the producer rate is low,
>> there
>> > will be some period of time the distribution is not even. Ex:
>> > tp-1: 12KB
>> > tp-2: 0KB
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2021-11-08 Thread Artem Livshits
> batching (accumulating) records into batches. So I think the "batch mode"
> > description is confusing. And that's why I asked you if you have some
> kind
> > of "batch switch" here.
> >
> > 2. In motivation, you mentioned 1 drawback of current
> > UniformStickyPartitioner is "the sticky partitioner doesn't create
> batches
> > as efficiently", because it sent out a batch with only 1 record (under
> > linger.ms=0). But I can't tell how you fix this un-efficient issue in
> the
> > proposed solution. I still see we sent 1 record within 1 batch. Could you
> > explain more here?
> >
> > Thank you.
> > Luke
> >
> > On Sat, Nov 6, 2021 at 6:41 AM Artem Livshits
> >  wrote:
> >
> > > Hi Luke,
> > >
> > > Thank you for your feedback.  I've updated the KIP with your
> suggestions.
> > >
> > > 1. Updated with a better example.
> > > 2. I removed the reference to ClassicDefaultPartitioner, it was
> probably
> > > confusing.
> > > 3. The logic doesn't rely on checking batches, I've updated the
> proposal
> > to
> > > make it more explicit.
> > > 4. The primary issue (uneven distribution) is described in the linked
> > jira,
> > > copied an example from jira into the KIP as well.
> > >
> > > -Artem
> > >
> > >
> > > On Thu, Nov 4, 2021 at 8:34 PM Luke Chen  wrote:
> > >
> > > > Hi Artem,
> > > > Thanks for the KIP! And thanks for reminding me to complete KIP-782,
> > > soon.
> > > > :)
> > > >
> > > > Back to the KIP, I have some comments:
> > > > 1. You proposed to have a new config:
> "partitioner.sticky.batch.size",
> > > but
> > > > I can't see how we're going to use it to make the partitioner better.
> > > > Please explain more in KIP (with an example will be better as
> > suggestion
> > > > (4))
> > > > 2. In the "Proposed change" section, you take an example to use
> > > > "ClassicDefaultPartitioner", is that referring to the current default
> > > > sticky partitioner? I think it'd better you name your proposed
> > partition
> > > > with a different name for distinguish between the default one and new
> > > one.
> > > > (Although after implementation, we are going to just use the same
> name)
> > > > 3. So, if my understanding is correct, you're going to have a "batch"
> > > > switch, and before the in-flight is full, it's disabled. Otherwise,
> > we'll
> > > > enable it. Is that right? Sorry, I don't see any advantage of having
> > this
> > > > batch switch. Could you explain more?
> > > > 4. I think it should be more clear if you can have a clear real
> example
> > > in
> > > > the motivation section, to describe what issue we faced using current
> > > > sticky partitioner. And in proposed changes section, using the same
> > > > example, to describe more detail about how you fix this issue with
> your
> > > > way.
> > > >
> > > > Thank you.
> > > > Luke
> > > >
> > > > On Fri, Nov 5, 2021 at 1:38 AM Artem Livshits
> > > >  wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > This is the discussion thread for
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > .
> > > > >
> > > > > The proposal is a bug fix for
> > > > > https://issues.apache.org/jira/browse/KAFKA-10888, but it does
> > > include a
> > > > > client config change, therefore we have a KIP to discuss.
> > > > >
> > > > > -Artem
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2021-11-05 Thread Artem Livshits
Hi Luke,

Thank you for your feedback.  I've updated the KIP with your suggestions.

1. Updated with a better example.
2. I removed the reference to ClassicDefaultPartitioner, it was probably
confusing.
3. The logic doesn't rely on checking batches, I've updated the proposal to
make it more explicit.
4. The primary issue (uneven distribution) is described in the linked jira,
copied an example from jira into the KIP as well.

-Artem


On Thu, Nov 4, 2021 at 8:34 PM Luke Chen  wrote:

> Hi Artem,
> Thanks for the KIP! And thanks for reminding me to complete KIP-782, soon.
> :)
>
> Back to the KIP, I have some comments:
> 1. You proposed to have a new config: "partitioner.sticky.batch.size", but
> I can't see how we're going to use it to make the partitioner better.
> Please explain more in KIP (with an example will be better as suggestion
> (4))
> 2. In the "Proposed change" section, you take an example to use
> "ClassicDefaultPartitioner", is that referring to the current default
> sticky partitioner? I think it'd better you name your proposed partition
> with a different name for distinguish between the default one and new one.
> (Although after implementation, we are going to just use the same name)
> 3. So, if my understanding is correct, you're going to have a "batch"
> switch, and before the in-flight is full, it's disabled. Otherwise, we'll
> enable it. Is that right? Sorry, I don't see any advantage of having this
> batch switch. Could you explain more?
> 4. I think it should be more clear if you can have a clear real example in
> the motivation section, to describe what issue we faced using current
> sticky partitioner. And in proposed changes section, using the same
> example, to describe more detail about how you fix this issue with your
> way.
>
> Thank you.
> Luke
>
> On Fri, Nov 5, 2021 at 1:38 AM Artem Livshits
>  wrote:
>
> > Hello,
> >
> > This is the discussion thread for
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > .
> >
> > The proposal is a bug fix for
> > https://issues.apache.org/jira/browse/KAFKA-10888, but it does include a
> > client config change, therefore we have a KIP to discuss.
> >
> > -Artem
> >
>


[DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

2021-11-04 Thread Artem Livshits
Hello,

This is the discussion thread for
https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
.

The proposal is a bug fix for
https://issues.apache.org/jira/browse/KAFKA-10888, but it does include a
client config change, therefore we have a KIP to discuss.

-Artem


Wiki permission request

2021-11-02 Thread Artem Livshits
Hello,

I'd like to be added to the contributors list, so that I can submit a KIP.

My Jira ID is: alivshits
Wiki ID: alivshits

Thanks,
-Artem


  1   2   >