Re: Ci stability

2022-11-24 Thread John Roesler
Hi Dan,

I’m not sure if there’s a consistently used tag, but I’ve gotten good mileage 
out of just searching for “flaky” or “flaky test” in Jira. 

If you’re thinking about filing a ticket for a specific test failure you’ve 
seen, I’ve also usually been able to find out whether there’s already a ticket 
by searching for the test class or method name. 

People seem to typically file tickets with “flaky” in the title and then the 
test name. 

Thanks again for your interest in improving the situation!
-John

On Thu, Nov 24, 2022, at 10:08, Dan S wrote:
> Thanks for the reply John! Is there a jira tag or view or something that
> can be used to find all the failing tests and maybe even try to fix them
> (even if fix just means extending a timeout)?
>
>
>
> On Thu, Nov 24, 2022, 16:03 John Roesler  wrote:
>
>> Hi Dan,
>>
>> Thanks for pointing this out. Flaky tests are a perennial problem. We
>> knock them out every now and then, but eventually more spring up.
>>
>> I’ve had some luck in the past filing Jira tickets for the failing tests
>> as they pop up in my PRs. Another thing that seems to motivate people is to
>> open a PR to disable the test in question, as you mention. That can be a
>> bit aggressive, though, so it wouldn’t be my first suggestion.
>>
>> I appreciate you bringing this up. I agree that flaky tests pose a risk to
>> the project because it makes it harder to know whether a PR breaks things
>> or not.
>>
>> Thanks,
>> John
>>
>> On Thu, Nov 24, 2022, at 02:38, Dan S wrote:
>> > Hello all,
>> >
>> > I've had a pr that has been open for a little over a month (several
>> > feedback cycles happened), and I've never seen a fully passing build
>> (tests
>> > in completely different parts of the codebase seemed to fail, often
>> > timeouts). A cursory look at open PRs seems to indicate that mine is not
>> > the only one. I was wondering if there is a place where all the flaky
>> tests
>> > are being tracked, and if it makes sense to fix (or at least temporarily
>> > disable) them so that confidence in new PRs could be increased.
>> >
>> > Thanks,
>> >
>> > Dan
>>


Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #1375

2022-11-24 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 513405 lines...]
[2022-11-24T17:30:34.799Z] 
[2022-11-24T17:30:34.799Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > TaskAssignorIntegrationTest > 
shouldProperlyConfigureTheAssignor PASSED
[2022-11-24T17:30:36.654Z] 
[2022-11-24T17:30:36.654Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > TaskMetadataIntegrationTest > 
shouldReportCorrectEndOffsetInformation STARTED
[2022-11-24T17:30:38.631Z] 
[2022-11-24T17:30:38.631Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > TaskMetadataIntegrationTest > 
shouldReportCorrectEndOffsetInformation PASSED
[2022-11-24T17:30:38.631Z] 
[2022-11-24T17:30:38.631Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > TaskMetadataIntegrationTest > 
shouldReportCorrectCommittedOffsetInformation STARTED
[2022-11-24T17:30:41.737Z] 
[2022-11-24T17:30:41.737Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > TaskMetadataIntegrationTest > 
shouldReportCorrectCommittedOffsetInformation PASSED
[2022-11-24T17:30:43.806Z] 
[2022-11-24T17:30:43.806Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > HandlingSourceTopicDeletionIntegrationTest > 
shouldThrowErrorAfterSourceTopicDeleted STARTED
[2022-11-24T17:30:51.351Z] 
[2022-11-24T17:30:51.351Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > HandlingSourceTopicDeletionIntegrationTest > 
shouldThrowErrorAfterSourceTopicDeleted PASSED
[2022-11-24T17:30:52.464Z] 
[2022-11-24T17:30:52.464Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > 
testConcurrentlyAccessThreads() STARTED
[2022-11-24T17:30:55.962Z] 
[2022-11-24T17:30:55.962Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > 
testConcurrentlyAccessThreads() PASSED
[2022-11-24T17:30:55.962Z] 
[2022-11-24T17:30:55.962Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > 
shouldResizeCacheAfterThreadReplacement() STARTED
[2022-11-24T17:31:00.099Z] 
[2022-11-24T17:31:00.099Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > 
shouldResizeCacheAfterThreadReplacement() PASSED
[2022-11-24T17:31:00.099Z] 
[2022-11-24T17:31:00.099Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > 
shouldAddAndRemoveThreadsMultipleTimes() STARTED
[2022-11-24T17:31:06.062Z] 
[2022-11-24T17:31:06.062Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > 
shouldAddAndRemoveThreadsMultipleTimes() PASSED
[2022-11-24T17:31:06.062Z] 
[2022-11-24T17:31:06.062Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > 
shouldnNotRemoveStreamThreadWithinTimeout() STARTED
[2022-11-24T17:31:07.270Z] 
[2022-11-24T17:31:07.270Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > 
shouldnNotRemoveStreamThreadWithinTimeout() PASSED
[2022-11-24T17:31:07.270Z] 
[2022-11-24T17:31:07.270Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > 
shouldAddAndRemoveStreamThreadsWhileKeepingNamesCorrect() STARTED
[2022-11-24T17:31:28.855Z] 
[2022-11-24T17:31:28.855Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > 
shouldAddAndRemoveStreamThreadsWhileKeepingNamesCorrect() PASSED
[2022-11-24T17:31:28.855Z] 
[2022-11-24T17:31:28.855Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > shouldAddStreamThread() 
STARTED
[2022-11-24T17:31:31.953Z] 
[2022-11-24T17:31:31.953Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > shouldAddStreamThread() PASSED
[2022-11-24T17:31:31.953Z] 
[2022-11-24T17:31:31.953Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > 
shouldRemoveStreamThreadWithStaticMembership() STARTED
[2022-11-24T17:31:36.179Z] 
[2022-11-24T17:31:36.179Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > 
shouldRemoveStreamThreadWithStaticMembership() PASSED
[2022-11-24T17:31:36.179Z] 
[2022-11-24T17:31:36.179Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > shouldRemoveStreamThread() 
STARTED
[2022-11-24T17:31:40.405Z] 
[2022-11-24T17:31:40.405Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 172 > AdjustStreamThreadCountTest > shouldRemoveStreamThread() 
PASSED
[2022-11-24T17:31:40.405Z] 
[2022-11-24T17:31:40.405Z] Gradle Test Run :streams:integrationTest 

Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.3 #125

2022-11-24 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 336046 lines...]
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:919:
 warning - Tag @link: reference not found: DefaultPartitioner
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:939:
 warning - Tag @link: reference not found: DefaultPartitioner
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:854:
 warning - Tag @link: reference not found: DefaultPartitioner
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:890:
 warning - Tag @link: reference not found: DefaultPartitioner
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:919:
 warning - Tag @link: reference not found: DefaultPartitioner
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:939:
 warning - Tag @link: reference not found: DefaultPartitioner
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/kstream/Produced.java:84:
 warning - Tag @link: reference not found: DefaultPartitioner
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/kstream/Produced.java:136:
 warning - Tag @link: reference not found: DefaultPartitioner
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/kstream/Produced.java:147:
 warning - Tag @link: reference not found: DefaultPartitioner
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/kstream/Repartitioned.java:101:
 warning - Tag @link: reference not found: DefaultPartitioner
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/kstream/Repartitioned.java:167:
 warning - Tag @link: reference not found: DefaultPartitioner
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/TopologyConfig.java:58:
 warning - Tag @link: missing '#': "org.apache.kafka.streams.StreamsBuilder()"
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/TopologyConfig.java:58:
 warning - Tag @link: can't find org.apache.kafka.streams.StreamsBuilder() in 
org.apache.kafka.streams.TopologyConfig
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/TopologyDescription.java:38:
 warning - Tag @link: reference not found: ProcessorContext#forward(Object, 
Object) forwards
[2022-11-24T17:05:41.232Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/query/Position.java:44:
 warning - Tag @link: can't find query(Query,
[2022-11-24T17:05:41.232Z]  PositionBound, boolean) in 
org.apache.kafka.streams.processor.StateStore
[2022-11-24T17:05:42.167Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:44:
 warning - Tag @link: can't find query(Query, PositionBound, boolean) in 
org.apache.kafka.streams.processor.StateStore
[2022-11-24T17:05:42.167Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:36:
 warning - Tag @link: can't find query(Query, PositionBound, boolean) in 
org.apache.kafka.streams.processor.StateStore
[2022-11-24T17:05:42.167Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:57:
 warning - Tag @link: can't find query(Query, PositionBound, boolean) in 
org.apache.kafka.streams.processor.StateStore
[2022-11-24T17:05:42.167Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:74:
 warning - Tag @link: can't find query(Query, PositionBound, boolean) in 
org.apache.kafka.streams.processor.StateStore
[2022-11-24T17:05:42.167Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.3/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:110:
 warning - Tag @link: reference not found: this#getResult()
[2022-11-24T17:05:42.167Z] 
/home/jenkins/jenkins

Re: Ci stability

2022-11-24 Thread Dan S
Thanks for the reply John! Is there a jira tag or view or something that
can be used to find all the failing tests and maybe even try to fix them
(even if fix just means extending a timeout)?



On Thu, Nov 24, 2022, 16:03 John Roesler  wrote:

> Hi Dan,
>
> Thanks for pointing this out. Flaky tests are a perennial problem. We
> knock them out every now and then, but eventually more spring up.
>
> I’ve had some luck in the past filing Jira tickets for the failing tests
> as they pop up in my PRs. Another thing that seems to motivate people is to
> open a PR to disable the test in question, as you mention. That can be a
> bit aggressive, though, so it wouldn’t be my first suggestion.
>
> I appreciate you bringing this up. I agree that flaky tests pose a risk to
> the project because it makes it harder to know whether a PR breaks things
> or not.
>
> Thanks,
> John
>
> On Thu, Nov 24, 2022, at 02:38, Dan S wrote:
> > Hello all,
> >
> > I've had a pr that has been open for a little over a month (several
> > feedback cycles happened), and I've never seen a fully passing build
> (tests
> > in completely different parts of the codebase seemed to fail, often
> > timeouts). A cursory look at open PRs seems to indicate that mine is not
> > the only one. I was wondering if there is a place where all the flaky
> tests
> > are being tracked, and if it makes sense to fix (or at least temporarily
> > disable) them so that confidence in new PRs could be increased.
> >
> > Thanks,
> >
> > Dan
>


Re: Ci stability

2022-11-24 Thread John Roesler
Hi Dan,

Thanks for pointing this out. Flaky tests are a perennial problem. We knock 
them out every now and then, but eventually more spring up.

I’ve had some luck in the past filing Jira tickets for the failing tests as 
they pop up in my PRs. Another thing that seems to motivate people is to open a 
PR to disable the test in question, as you mention. That can be a bit 
aggressive, though, so it wouldn’t be my first suggestion.

I appreciate you bringing this up. I agree that flaky tests pose a risk to the 
project because it makes it harder to know whether a PR breaks things or not. 

Thanks,
John

On Thu, Nov 24, 2022, at 02:38, Dan S wrote:
> Hello all,
>
> I've had a pr that has been open for a little over a month (several
> feedback cycles happened), and I've never seen a fully passing build (tests
> in completely different parts of the codebase seemed to fail, often
> timeouts). A cursory look at open PRs seems to indicate that mine is not
> the only one. I was wondering if there is a place where all the flaky tests
> are being tracked, and if it makes sense to fix (or at least temporarily
> disable) them so that confidence in new PRs could be increased.
>
> Thanks,
>
> Dan


subscribe kafka issues

2022-11-24 Thread ChunYa Sun
subscribe Kafka issues


Re: [DISCUSS] KIP-866 ZooKeeper to KRaft Migration

2022-11-24 Thread Igor Soarez
Hi David,

Zookeeper mode writes meta.properties with version=0. KRaft mode requires 
version=1 in meta.properties.

Will a manual step be required to update meta.properties or will brokers 
somehow update meta.properties files to version 1?

Thanks,

--
Igor



[jira] [Resolved] (KAFKA-14009) Rebalance timeout should be updated when static member rejoins

2022-11-24 Thread David Jacot (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Jacot resolved KAFKA-14009.
-
Fix Version/s: 3.4.0
   3.3.2
   Resolution: Fixed

> Rebalance timeout should be updated when static member rejoins
> --
>
> Key: KAFKA-14009
> URL: https://issues.apache.org/jira/browse/KAFKA-14009
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, core
>Affects Versions: 2.3.1, 2.6.1
>Reporter: zou shengfu
>Assignee: zou shengfu
>Priority: Minor
> Fix For: 3.4.0, 3.3.2
>
>
> When consumer use static membership rebalance protocol, consumer want to 
> reduce rebalance timeout but it do not take effect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14372) RackAwareReplicaSelector should choose a replica from the isr

2022-11-24 Thread David Jacot (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Jacot resolved KAFKA-14372.
-
Fix Version/s: 3.4.0
   3.3.2
   Resolution: Fixed

> RackAwareReplicaSelector should choose a replica from the isr
> -
>
> Key: KAFKA-14372
> URL: https://issues.apache.org/jira/browse/KAFKA-14372
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jeff Kim
>Assignee: Jeff Kim
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
>
> The default replica selector chooses a replica on whether the broker.rack 
> matches the client.rack in the fetch request and whether the offset exists in 
> the follower. If the follower is not in the ISR, we know it's lagging behind 
> which will also lag the consumer behind. Let's consider two cases:
>  # the follower recovers and joins the isr. the consumer will no longer lag
>  # the follower continues to lag behind. after 5 minutes, the consumer will 
> refresh the preferred read replica and it returns the same lagging follower 
> since the offset the consumer will fetch from is capped by the follower's 
> HWM. this can go on indefinitely
> If the replica selector chooses a broker in the ISR then we can ensure that 
> at least every 5 minutes the consumer will consume from an up-to-date 
> replica. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

2022-11-24 Thread Jorge Esteban Quilcate Otoya
Thanks Chris! I have updated the KIP with "transform" instead of "alias".
Agree it's clearer.

Cheers,
Jorge.

On Mon, 21 Nov 2022 at 21:36, Chris Egerton  wrote:

> Hi Jorge,
>
> Thanks for the updates, and apologies for the delay. The new diagram
> directly under the "Proposed Changes" section is absolutely gorgeous!
>
>
> Follow-ups:
>
> RE 2: Good point. We can use the same level for these metrics, it's not a
> big deal.
>
> RE 3: As long as all the per-record metrics are kept at DEBUG level, it
> should be fine to leave JMH benchmarking for a follow-up. If we want to add
> new per-record, INFO-level metrics, I would be more comfortable with
> including benchmarking as part of the testing plan for the KIP. One
> possible compromise could be to propose that these features be merged at
> DEBUG level, and then possibly upgraded to INFO level in the future pending
> benchmarks to guard against performance degradation.
>
> RE 4: I think for a true "end-to-end" metric, it'd be useful to include the
> time taken by the task to actually deliver the record. However, with the
> new metric names and descriptions provided in the KIP, I have no objections
> with what's currently proposed, and a new "end-to-end" metric can be taken
> on later in a follow-up KIP.
>
> RE 6: You're right, existing producer metrics should be enough for now. We
> can revisit this later if/when we add delivery-centric metrics for sink
> tasks as well.
>
> RE 7: The new metric names in the KIP LGTM; I don't see any need to expand
> beyond those but if you'd still like to pursue others, LMK.
>
>
> New thoughts:
>
> One small thought: instead of "alias" in "alias="{transform_alias}" for the
> per-transform metrics, could we use "transform"? IMO it's clearer since we
> don't use "alias" in the names of transform-related properties, and "alias"
> may be confused with the classloading term where you can use, e.g.,
> "FileStreamSource" as the name of a connector class in a connector config
> instead of "org.apache.kafka.connect.file.FileStreamSourceConnector".
>
>
> Cheers,
>
> Chris
>
> On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Thanks Mickael!
> >
> >
> > On Wed, 9 Nov 2022 at 15:54, Mickael Maison 
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for the KIP, it is a nice improvement.
> > >
> > > 1) The per transformation metrics still have a question mark next to
> > > them in the KIP. Do you want to include them? If so we'll want to tag
> > > them, we should be able to include the aliases in TransformationChain
> > > and use them.
> > >
> >
> > Yes, I have added the changes on TransformChain that will be needed to
> add
> > these metrics.
> >
> >
> > >
> > > 2) I see no references to predicates. If we don't want to measure
> > > their latency, can we say it explicitly?
> > >
> >
> > Good question, I haven't considered these. Though as these are
> materialized
> > as PredicatedTransformation, they should be covered by these changes.
> > Adding a note about this.
> >
> >
> > >
> > > 3) Should we have sink-record-batch-latency-avg-ms? All other metrics
> > > have both the maximum and average values.
> > >
> > >
> > Good question. I will remove it and change the record latency from
> > DEBUG->INFO as it already cover the maximum metric.
> >
> > Hope it's clearer now, let me know if there any additional feedback.
> > Thanks!
> >
> >
> >
> > > Thanks,
> > > Mickael
> > >
> > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > >  wrote:
> > > >
> > > > Thanks, Chris! Great feedback! Please, find my comments below:
> > > >
> > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton  >
> > > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Thanks for the KIP. I agree with the overall direction and think
> this
> > > would
> > > > > be a nice improvement to Kafka Connect. Here are my initial
> thoughts
> > > on the
> > > > > details:
> > > > >
> > > > > 1. The motivation section outlines the gaps in Kafka Connect's task
> > > metrics
> > > > > nicely. I think it'd be useful to include more concrete details on
> > why
> > > > > these gaps need to be filled in, and in which cases additional
> > metrics
> > > > > would be helpful. One goal could be to provide enhanced monitoring
> of
> > > > > production deployments that allows for cluster administrators to
> set
> > up
> > > > > automatic alerts for latency spikes and, if triggered, quickly
> > > identify the
> > > > > root cause of those alerts, reducing the time to remediation.
> Another
> > > goal
> > > > > could be to provide more insight to developers or cluster
> > > administrators
> > > > > who want to do performance testing on connectors in non-production
> > > > > environments. It may help guide our decision making process to
> have a
> > > > > clearer picture of the goals we're trying to achieve.
> > > > >
> > > >
> > > > Agree. The Motivation section has been updated.
> > > > Thanks for the examples, I see both of th

[jira] [Created] (KAFKA-14419) Same message consumed again by the same stream task after partition is lost and reassigned

2022-11-24 Thread Mikael (Jira)
Mikael created KAFKA-14419:
--

 Summary: Same message consumed again by the same stream task after 
partition is lost and reassigned
 Key: KAFKA-14419
 URL: https://issues.apache.org/jira/browse/KAFKA-14419
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 3.3.1
 Environment: AWS EC2 CentOS Linux 3.10.0-1160.76.1.el7.x86_64
Reporter: Mikael


Trigger scenario:

Four Kafka client application instances on separate EC2 instances with a total 
of 8 active and 8 standby stream tasks for the same stream topology, consuming 
from an input topic with 8 partitions. Sometimes a handful of messages are 
consumed twice by one of the stream tasks when stream tasks on another 
application instance join the consumer group after an application instance 
restart.

Additional information:

Messages are produced to the topic by another Kafka streams topology deployed 
on the same four application instances. I have verified that each message is 
only produced once by enabling debug logging in the topology flow right before 
producing each message to the topic.

Logs from stream thread with duplicate consumption:

 
{code:java}
2022-11-21 15:09:33,677 INFO 
[messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
 o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:1066] [Consumer 
clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
 groupId=messages.xms.mt.enqueue.sms] Request joining group due to: group is 
already rebalancing
2022-11-21 15:09:33,677 INFO 
[messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
 o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:566] [Consumer 
clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
 groupId=messages.xms.mt.enqueue.sms] (Re-)joining group

Input records consumed for the first time

2022-11-21 15:09:33,919 INFO 
[messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
 o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:627] [Consumer 
clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
 groupId=messages.xms.mt.enqueue.sms] Successfully joined group with generation 
Generation{generationId=8017, 
memberId='messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer-77a68a5a-fb15-4808-9d87-30f21eabea74',
 protocol='stream'}
2022-11-21 15:09:33,920 INFO 
[messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
 o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:826] [Consumer 
clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
 groupId=messages.xms.mt.enqueue.sms] SyncGroup failed: The group began another 
rebalance. Need to re-join the group. Sent generation was 
Generation{generationId=8017, 
memberId='messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer-77a68a5a-fb15-4808-9d87-30f21eabea74',
 protocol='stream'}
2022-11-21 15:09:33,922 INFO 
[messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
 o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:1019] [Consumer 
clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
 groupId=messages.xms.mt.enqueue.sms] Resetting generation due to: encountered 
REBALANCE_IN_PROGRESS from SYNC_GROUP response
2022-11-21 15:09:33,922 INFO 
[messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
 o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:1066] [Consumer 
clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
 groupId=messages.xms.mt.enqueue.sms] Request joining group due to: encountered 
REBALANCE_IN_PROGRESS from SYNC_GROUP response
2022-11-21 15:09:33,923 INFO 
[messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
 o.a.k.c.c.i.ConsumerCoordinator [ConsumerCoordinator.java:819] [Consumer 
clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
 groupId=messages.xms.mt.enqueue.sms] Giving away all assigned partitions as 
lost since generation/memberID has been reset,indicating that consumer is in 
old state or no longer part of the group
2022-11-21 15:09:33,923 INFO 
[messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
 o.a.k.c.c.i.ConsumerCoordinator [ConsumerCoordinator.java:354] [Consumer 
clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
 groupId=messages.xms.mt.enqueue.sms] Lost previously assigned partitions 
messages.xms.mt.batch.enqueue.sms-1
2022-11-21 15:09:33,923 INFO 
[messages.xms.mt.enqueue.sms-acde244d-99b7-

[jira] [Created] (KAFKA-14418) Add safety checks for modifying partitions of __consumer_offsets

2022-11-24 Thread Divij Vaidya (Jira)
Divij Vaidya created KAFKA-14418:


 Summary: Add safety checks for modifying partitions of 
__consumer_offsets
 Key: KAFKA-14418
 URL: https://issues.apache.org/jira/browse/KAFKA-14418
 Project: Kafka
  Issue Type: Improvement
Reporter: Divij Vaidya
 Fix For: 3.4.0


Today a user can change the number of partitions of 
{{{}__{}}}{{{}_consumer__{}}}{{{}offsets{}}} topic by changing the 
configuration value for {{offsets.topic.num.partitions}} or manually by using 
CreatePartition API. 

Changing offsets of this reserved partition leads to problems with consumer 
groups unless you restart all brokers.  Thus, there is a high probability that 
is this operations is not done right, users may shoot themselves in the foot. 
Example scenario: 
[https://stackoverflow.com/questions/73944561/kafka-consumer-group-coordinator-inconsistent]
 

To remedy this, I propose the following changes:
1. `kafka-topic.sh` should explicitly block adding new partitions to internal 
topics
2. Add an operational guide to the docs for safely modifying partitions of 
__consumer_offsets at [https://kafka.apache.org/documentation.html#basic_ops] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Ci stability

2022-11-24 Thread Dan S
Hello all,

I've had a pr that has been open for a little over a month (several
feedback cycles happened), and I've never seen a fully passing build (tests
in completely different parts of the codebase seemed to fail, often
timeouts). A cursory look at open PRs seems to indicate that mine is not
the only one. I was wondering if there is a place where all the flaky tests
are being tracked, and if it makes sense to fix (or at least temporarily
disable) them so that confidence in new PRs could be increased.

Thanks,

Dan