Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.4 #155

2023-08-08 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 4405 lines...]
  - group 5 entry nodes: [:spotlessScalaCheck (complete)]
  - group 6 entry nodes: [:clients:checkstyleMain (complete), 
:connect:checkstyleMain (complete), :core:checkstyleMain (complete), 
:examples:checkstyleMain (complete), :generator:checkstyleMain (complete), 
:group-coordinator:checkstyleMain (complete), :jmh-benchmarks:checkstyleMain 
(complete), :log4j-appender:checkstyleMain (complete), :metadata:checkstyleMain 
(complete), :raft:checkstyleMain (complete), :server-common:checkstyleMain 
(complete), :shell:checkstyleMain (complete), :storage:checkstyleMain 
(complete), :streams:checkstyleMain (complete), :tools:checkstyleMain 
(complete), :trogdor:checkstyleMain (complete), :connect:api:checkstyleMain 
(complete), :connect:basic-auth-extension:checkstyleMain (complete), 
:connect:file:checkstyleMain (complete), :connect:json:checkstyleMain 
(complete), :connect:mirror:checkstyleMain (complete), 
:connect:mirror-client:checkstyleMain (complete), 
:connect:runtime:checkstyleMain (complete), :connect:transforms:checkstyleMain 
(complete), :storage:api:checkstyleMain (complete), 
:streams:examples:checkstyleMain (complete), 
:streams:streams-scala:checkstyleMain (complete), 
:streams:test-utils:checkstyleMain (complete), 
:streams:upgrade-system-tests-0100:checkstyleMain (complete), 
:streams:upgrade-system-tests-0101:checkstyleMain (complete), 
:streams:upgrade-system-tests-0102:checkstyleMain (complete), 
:streams:upgrade-system-tests-0110:checkstyleMain (complete), 
:streams:upgrade-system-tests-10:checkstyleMain (complete), 
:streams:upgrade-system-tests-11:checkstyleMain (complete), 
:streams:upgrade-system-tests-20:checkstyleMain (complete), 
:streams:upgrade-system-tests-21:checkstyleMain (complete), 
:streams:upgrade-system-tests-22:checkstyleMain (complete), 
:streams:upgrade-system-tests-23:checkstyleMain (complete), 
:streams:upgrade-system-tests-24:checkstyleMain (complete), 
:streams:upgrade-system-tests-25:checkstyleMain (complete), 
:streams:upgrade-system-tests-26:checkstyleMain (complete), 
:streams:upgrade-system-tests-27:checkstyleMain (complete), 
:streams:upgrade-system-tests-28:checkstyleMain (complete), 
:streams:upgrade-system-tests-30:checkstyleMain (complete), 
:streams:upgrade-system-tests-31:checkstyleMain (complete), 
:streams:upgrade-system-tests-32:checkstyleMain (complete), 
:streams:upgrade-system-tests-33:checkstyleMain (complete)]
  - group 7 entry nodes: [:clients:checkstyleTest (complete), 
:connect:checkstyleTest (complete), :core:checkstyleTest (complete), 
:examples:checkstyleTest (complete), :generator:checkstyleTest (complete), 
:group-coordinator:checkstyleTest (complete), :jmh-benchmarks:checkstyleTest 
(complete), :log4j-appender:checkstyleTest (complete), :metadata:checkstyleTest 
(complete), :raft:checkstyleTest (complete), :server-common:checkstyleTest 
(complete), :shell:checkstyleTest (complete), :storage:checkstyleTest 
(complete), :streams:checkstyleTest (complete), :tools:checkstyleTest 
(complete), :trogdor:checkstyleTest (complete), :connect:api:checkstyleTest 
(complete), :connect:basic-auth-extension:checkstyleTest (complete), 
:connect:file:checkstyleTest (complete), :connect:json:checkstyleTest 
(complete), :connect:mirror:checkstyleTest (complete), 
:connect:mirror-client:checkstyleTest (complete), 
:connect:runtime:checkstyleTest (complete), :connect:transforms:checkstyleTest 
(complete), :storage:api:checkstyleTest (complete), 
:streams:examples:checkstyleTest (complete), 
:streams:streams-scala:checkstyleTest (complete), 
:streams:test-utils:checkstyleTest (complete), 
:streams:upgrade-system-tests-0100:checkstyleTest (complete), 
:streams:upgrade-system-tests-0101:checkstyleTest (complete), 
:streams:upgrade-system-tests-0102:checkstyleTest (complete), 
:streams:upgrade-system-tests-0110:checkstyleTest (complete), 
:streams:upgrade-system-tests-10:checkstyleTest (complete), 
:streams:upgrade-system-tests-11:checkstyleTest (complete), 
:streams:upgrade-system-tests-20:checkstyleTest (complete), 
:streams:upgrade-system-tests-21:checkstyleTest (complete), 
:streams:upgrade-system-tests-22:checkstyleTest (complete), 
:streams:upgrade-system-tests-23:checkstyleTest (complete), 
:streams:upgrade-system-tests-24:checkstyleTest (complete), 
:streams:upgrade-system-tests-25:checkstyleTest (complete), 
:streams:upgrade-system-tests-26:checkstyleTest (complete), 
:streams:upgrade-system-tests-27:checkstyleTest (complete), 
:streams:upgrade-system-tests-28:checkstyleTest (complete), 
:streams:upgrade-system-tests-30:checkstyleTest (complete), 
:streams:upgrade-system-tests-31:checkstyleTest (complete), 
:streams:upgrade-system-tests-32:checkstyleTest (complete), 
:streams:upgrade-system-tests-33:checkstyleTest (complete)]
  - group 8 entry nodes: 

[jira] [Created] (KAFKA-15324) Do not bump the leader epoch when changing the replica set

2023-08-08 Thread zeyuliu (Jira)
zeyuliu created KAFKA-15324:
---

 Summary: Do not bump the leader epoch when changing the replica set
 Key: KAFKA-15324
 URL: https://issues.apache.org/jira/browse/KAFKA-15324
 Project: Kafka
  Issue Type: Task
Reporter: zeyuliu


The KRaft controller increases the leader epoch when a partition replica set 
shrink. This is not strictly required and should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2078

2023-08-08 Thread Apache Jenkins Server
See 




Re: [QUESTION] about topic dimension quota

2023-08-08 Thread hudeqi
In fact, I have implemented the bytesIn/bytesOut limit of the topic dimension. 
I don't know the community's attitude towards this feature, so I don't know if 
I need to propose a KIP to contribute.

best,
hudeqi


 -原始邮件-
 发件人: hudeqi 16120...@bjtu.edu.cn
 发送时间: 2023-08-08 21:10:39 (星期二)
 收件人: dev@kafka.apache.org
 抄送: 
 主题: [QUESTION] about topic dimension quota
 


Re: [DISCUSS] KIP-965: Support disaster recovery between clusters by MirrorMaker

2023-08-08 Thread hudeqi
Thanks for your suggestion, Ryanne. I have updated the configuration name in 
cwiki.

best,
hudeqi

Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.4 #154

2023-08-08 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 440796 lines...]
Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testGetLogConfigs() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testGetLogConfigs() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testBrokerSequenceIdMethods() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testBrokerSequenceIdMethods() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testAclMethods() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testAclMethods() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testCreateSequentialPersistentPath() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testCreateSequentialPersistentPath() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testConditionalUpdatePath() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testConditionalUpdatePath() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testGetAllTopicsInClusterTriggersWatch() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testGetAllTopicsInClusterTriggersWatch() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testDeleteTopicZNode() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testDeleteTopicZNode() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testDeletePath() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testDeletePath() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testGetBrokerMethods() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testGetBrokerMethods() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testJuteMaxBufffer() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testJuteMaxBufffer() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testCreateTokenChangeNotification() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testCreateTokenChangeNotification() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testGetTopicsAndPartitions() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testGetTopicsAndPartitions() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testChroot(boolean) > 
kafka.zk.KafkaZkClientTest.testChroot(boolean)[1] STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testChroot(boolean) > 
kafka.zk.KafkaZkClientTest.testChroot(boolean)[1] PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testChroot(boolean) > 
kafka.zk.KafkaZkClientTest.testChroot(boolean)[2] STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testChroot(boolean) > 
kafka.zk.KafkaZkClientTest.testChroot(boolean)[2] PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testRegisterBrokerInfo() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testRegisterBrokerInfo() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testRetryRegisterBrokerInfo() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testRetryRegisterBrokerInfo() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testConsumerOffsetPath() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testConsumerOffsetPath() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testDeleteRecursiveWithControllerEpochVersionCheck() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testDeleteRecursiveWithControllerEpochVersionCheck() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 169 > 
KafkaZkClientTest > testTopicAssignments() STARTED

Gradle Test Run 

[jira] [Created] (KAFKA-15323) Possible thread leaks in ListOffsetsHandlerTest

2023-08-08 Thread Kirk True (Jira)
Kirk True created KAFKA-15323:
-

 Summary: Possible thread leaks in ListOffsetsHandlerTest
 Key: KAFKA-15323
 URL: https://issues.apache.org/jira/browse/KAFKA-15323
 Project: Kafka
  Issue Type: Bug
  Components: clients, consumer, unit tests
Reporter: Kirk True
Assignee: Kirk True


Relevant Logs for AbstractCoordinatorTest - 

{noformat}
Relevant Logs for AbstractCoordinatorTest - 
Thread: 284 - kafka-coordinator-heartbeat-thread | dummy-group
java.base@17.0.7/java.lang.Object.wait(Native Method)
java.base@17.0.7/java.lang.Object.wait(Object.java:338)

app//org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:1448)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15322) Possible thread leaks in AbstractCoordinatorTest

2023-08-08 Thread Kirk True (Jira)
Kirk True created KAFKA-15322:
-

 Summary: Possible thread leaks in AbstractCoordinatorTest
 Key: KAFKA-15322
 URL: https://issues.apache.org/jira/browse/KAFKA-15322
 Project: Kafka
  Issue Type: Bug
  Components: clients, consumer, unit tests
Reporter: Kirk True
Assignee: Kirk True


Relevant Logs for AbstractCoordinatorTest - 

 
{code:java}
Relevant Logs for AbstractCoordinatorTest - 
Thread: 284 - kafka-coordinator-heartbeat-thread | dummy-group
java.base@17.0.7/java.lang.Object.wait(Native Method)
java.base@17.0.7/java.lang.Object.wait(Object.java:338) 
app//org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:1448)
 {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2077

2023-08-08 Thread Apache Jenkins Server
See 




[jira] [Resolved] (KAFKA-15312) FileRawSnapshotWriter must flush before atomic move

2023-08-08 Thread Jira


 [ 
https://issues.apache.org/jira/browse/KAFKA-15312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

José Armando García Sancio resolved KAFKA-15312.

Resolution: Fixed

> FileRawSnapshotWriter must flush before atomic move
> ---
>
> Key: KAFKA-15312
> URL: https://issues.apache.org/jira/browse/KAFKA-15312
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Reporter: José Armando García Sancio
>Assignee: José Armando García Sancio
>Priority: Major
> Fix For: 3.3.3, 3.6.0, 3.4.2, 3.5.2
>
>
> On ext4 file systems it is possible for KRaft to create zero-length snapshot 
> files. Not all file system fsync to disk on close. For KRaft to guarantee 
> that the data has made it to disk before calling rename, it needs to make 
> sure that the file has been fsync.
> We have seen cases were the snapshot file has zero-length data on ext4 file 
> system.
> {quote} "Delayed allocation" means that the filesystem tries to delay the 
> allocation of physical disk blocks for written data for as long as possible. 
> This policy brings some important performance benefits. Many files are 
> short-lived; delayed allocation can keep the system from writing fleeting 
> temporary files to disk at all. And, for longer-lived files, delayed 
> allocation allows the kernel to accumulate more data and to allocate the 
> blocks for data contiguously, speeding up both the write and any subsequent 
> reads of that data. It's an important optimization which is found in most 
> contemporary filesystems.
> But, if blocks have not been allocated for a file, there is no need to write 
> them quickly as a security measure. Since the blocks do not yet exist, it is 
> not possible to read somebody else's data from them. So ext4 will not 
> (cannot) write out unallocated blocks as part of the next journal commit 
> cycle. Those blocks will, instead, wait until the kernel decides to flush 
> them out; at that point, physical blocks will be allocated on disk and the 
> data will be made persistent. The kernel doesn't like to let file data sit 
> unwritten for too long, but it can still take a minute or so (with the 
> default settings) for that data to be flushed - far longer than the five 
> seconds normally seen with ext3. And that is why a crash can cause the loss 
> of quite a bit more data when ext4 is being used. 
> {quote}
> from: [https://lwn.net/Articles/322823/]
> {quote}auto_da_alloc ( * ), noauto_da_alloc
> Many broken applications don't use fsync() when replacing existing files via 
> patterns such as fd = open("foo.new")/write(fd,..)/close(fd)/ 
> rename("foo.new", "foo"), or worse yet, fd = open("foo", 
> O_TRUNC)/write(fd,..)/close(fd). If auto_da_alloc is enabled, ext4 will 
> detect the replace-via-rename and replace-via-truncate patterns and force 
> that any delayed allocation blocks are allocated such that at the next 
> journal commit, in the default data=ordered mode, the data blocks of the new 
> file are forced to disk before the rename() operation is committed. This 
> provides roughly the same level of guarantees as ext3, and avoids the 
> "zero-length" problem that can happen when a system crashes before the 
> delayed allocation blocks are forced to disk.
> {quote}
> from: [https://www.kernel.org/doc/html/latest/admin-guide/ext4.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.5 #54

2023-08-08 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 189332 lines...]
Gradle Test Run :core:integrationTest > Gradle Test Executor 180 > 
ZooKeeperClientTest > testZooKeeperStateChangeRateMetrics() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 180 > 
ZooKeeperClientTest > testZNodeChangeHandlerForDeletion() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 180 > 
ZooKeeperClientTest > testZNodeChangeHandlerForDeletion() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 180 > 
ZooKeeperClientTest > testGetAclNonExistentZNode() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 180 > 
ZooKeeperClientTest > testGetAclNonExistentZNode() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 180 > 
ZooKeeperClientTest > testStateChangeHandlerForAuthFailure() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 180 > 
ZooKeeperClientTest > testStateChangeHandlerForAuthFailure() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 186 > 
PlaintextAdminIntegrationTest > testOffsetsForTimesAfterDeleteRecords(String) > 
kafka.api.PlaintextAdminIntegrationTest.testOffsetsForTimesAfterDeleteRecords(String)[1]
 STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 185 > 
FetchFromFollowerIntegrationTest > testRackAwareRangeAssignor() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 186 > 
PlaintextAdminIntegrationTest > testOffsetsForTimesAfterDeleteRecords(String) > 
kafka.api.PlaintextAdminIntegrationTest.testOffsetsForTimesAfterDeleteRecords(String)[1]
 PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 186 > 
PlaintextAdminIntegrationTest > testOffsetsForTimesAfterDeleteRecords(String) > 
kafka.api.PlaintextAdminIntegrationTest.testOffsetsForTimesAfterDeleteRecords(String)[2]
 STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 185 > 
FetchFromFollowerIntegrationTest > testRackAwareRangeAssignor() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 186 > 
PlaintextAdminIntegrationTest > testOffsetsForTimesAfterDeleteRecords(String) > 
kafka.api.PlaintextAdminIntegrationTest.testOffsetsForTimesAfterDeleteRecords(String)[2]
 PASSED

2027 tests completed, 2 failed, 4 skipped
There were failing tests. See the report at: 
file:///home/jenkins/workspace/Kafka_kafka_3.5/core/build/reports/tests/integrationTest/index.html

> Task :streams:integrationTest

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testStickyTaskAssignorManyStandbys PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testStickyTaskAssignorManyThreadsPerClient STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testStickyTaskAssignorManyThreadsPerClient PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testFallbackPriorTaskAssignorManyThreadsPerClient 
STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testFallbackPriorTaskAssignorManyThreadsPerClient 
PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testFallbackPriorTaskAssignorLargePartitionCount 
STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testFallbackPriorTaskAssignorLargePartitionCount 
PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testStickyTaskAssignorLargePartitionCount STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testStickyTaskAssignorLargePartitionCount PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testFallbackPriorTaskAssignorManyStandbys STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testFallbackPriorTaskAssignorManyStandbys PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testHighAvailabilityTaskAssignorManyStandbys 
STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testHighAvailabilityTaskAssignorManyStandbys PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testFallbackPriorTaskAssignorLargeNumConsumers 
STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 
StreamsAssignmentScaleTest > testFallbackPriorTaskAssignorLargeNumConsumers 
PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 183 > 

[jira] [Created] (KAFKA-15321) Document consumer group member state machine

2023-08-08 Thread Kirk True (Jira)
Kirk True created KAFKA-15321:
-

 Summary: Document consumer group member state machine
 Key: KAFKA-15321
 URL: https://issues.apache.org/jira/browse/KAFKA-15321
 Project: Kafka
  Issue Type: Sub-task
Reporter: Kirk True
Assignee: Kirk True


We need to first document the new consumer group member state machine. What are 
the different states and what are the transitions?

See [~pnee]'s notes: 
[https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design]

*_Don’t forget to include diagrams for clarity!_*

This should be documented on the AK wiki.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15320) Document event queueing patterns

2023-08-08 Thread Kirk True (Jira)
Kirk True created KAFKA-15320:
-

 Summary: Document event queueing patterns
 Key: KAFKA-15320
 URL: https://issues.apache.org/jira/browse/KAFKA-15320
 Project: Kafka
  Issue Type: Sub-task
Reporter: Kirk True
Assignee: Kirk True


We need to first document the event enqueuing patterns in the 
PrototypeAsyncConsumer. As part of this task, determine if it’s 
necessary/beneficial to _conditionally_ add events and/or coalesce any 
duplicate events in the queue.

_Don’t forget to include diagrams for clarity!_

This should be documented on the AK wiki.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15319) Upgrade rocksdb to fix CVE-2022-37434

2023-08-08 Thread Maruthi (Jira)
Maruthi created KAFKA-15319:
---

 Summary: Upgrade rocksdb to fix CVE-2022-37434
 Key: KAFKA-15319
 URL: https://issues.apache.org/jira/browse/KAFKA-15319
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.4.1
Reporter: Maruthi


Rocksdbjni<7.9.2 is vulnerable to CVE-2022-37434 due to zlib 1.2.12

Upgrade to 1.2.13 to fix 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Apache Kafka 3.4.1 release

2023-08-08 Thread José Armando García Sancio
Hey Luke,

Thanks for working on the release for 3.4.1. I was working on some
cherry picks and I noticed that branch 3.4 doesn't contain the
commit/tag for 3.4.1. I think we are supposed to merge the tag back to
the 3.4 branch. E.g.:

> Merge the last version change / rc tag into the release branch and bump the 
> version to 0.10.0.1-SNAPSHOT
>
> git checkout 0.10.0
> git merge 0.10.0.0-rc6

from: https://cwiki.apache.org/confluence/display/KAFKA/Release+Process

Did we forget to do that part?

Thanks!
-- 
-José


Re: Flaky tests need attention (clients, Connect, Mirror Maker, Streams, etc.)

2023-08-08 Thread Greg Harris
Hey all,

MirrorConnectorsIntegration*Test#testOffsetTranslationBehindReplicationFlow()
flakiness should be addressed by
https://github.com/apache/kafka/pull/14156 which is currently in
review. There is other flakiness in the suite which this PR does not
address and needs further investigation.

The MirrorConnectorsIntegration*Test suites certainly need some
optimization. I've noticed that even in local test environments the
tear-down can be quite lengthy, and we haven't figured out the cause
yet. Each suite has >=11 tests, and there's 6 suites each contributing
15-20m to the test execution time.

Thanks,
Greg

On Tue, Aug 8, 2023 at 3:23 PM Kirk True  wrote:
>
> Here are some useful links for filtering failing tests by branches:
>
> All branches: 
> https://ge.apache.org/scans/tests?search.rootProjectNames=kafka=P28D
>
> trunk-only: 
> https://ge.apache.org/scans/tests?search.rootProjectNames=kafka=P28D=Git%20branch=trunk
>
> All branches except trunk: 
> https://ge.apache.org/scans/tests?search.rootProjectNames=kafka=P28D=not:Git%20branch=trunk
>
> Thanks,
> Kirk
>
> > On Aug 8, 2023, at 3:18 PM, Justine Olshan  
> > wrote:
> >
> > Thanks Kirk!
> > I will try to go through in the next day or so and see if there is any
> > tests I can fix.
> >
> > On Tue, Aug 8, 2023 at 3:13 PM Kirk True  wrote:
> >
> >> Hi Justine,
> >>
> >>> On Aug 1, 2023, at 4:50 PM, Justine Olshan 
> >> wrote:
> >>>
> >>> Is that right that the first one on the list (
> >>>
> >> org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationExactlyOnceTest)
> >>> takes
> >>> 20 minutes?! That's quite a test.
> >>> I wonder if the length corresponds to whether it passes, but we should
> >> fix
> >>> it and maybe move it out of our PR builds.
> >>
> >> It certainly does seems like it is taking that long PLUS any retries since
> >> it’s flaky.
> >>
> >>> I was also wondering if we could distinguish PR builds from trunk builds.
> >>> That might give us a better signal since PR builds could be before tests
> >>> are fixed. Not sure which one is being reported here.
> >>
> >> It includes both trunk and PR branch test runs. I’ll see how easy it is to
> >> filter them out.
> >>
> >> Thanks!
> >>
> >>>
> >>> Thanks for sharing though! This is a useful tool that we've needed for a
> >>> while.
> >>>
> >>> Justine
> >>>
> >>> On Tue, Aug 1, 2023 at 4:38 PM Kirk True  wrote:
> >>>
>  Hi!
> 
>  According to the Gradle Enterprise statistics on our recent Kafka
> >> builds,
>  over 90% have flaky tests [1].
> 
>  We also have 106 open Jiras with the “flaky-test” label across several
>  functional areas of the project [2].
> 
>  Can I ask that those familiar with those different functional areas
> >> take a
>  look at the list of flaky tests and triage them?
> 
>  Thanks,
>  Kirk
> 
>  [1]
> 
> >> https://ge.apache.org/scans/tests?search.relativeStartTime=P28D=kafka
>  [2]
> 
> >> https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20labels%20%3D%20flaky-test
> >>
> >>
>


Re: Flaky tests need attention (clients, Connect, Mirror Maker, Streams, etc.)

2023-08-08 Thread Kirk True
Here are some useful links for filtering failing tests by branches:

All branches: 
https://ge.apache.org/scans/tests?search.rootProjectNames=kafka=P28D

trunk-only: 
https://ge.apache.org/scans/tests?search.rootProjectNames=kafka=P28D=Git%20branch=trunk

All branches except trunk: 
https://ge.apache.org/scans/tests?search.rootProjectNames=kafka=P28D=not:Git%20branch=trunk

Thanks,
Kirk

> On Aug 8, 2023, at 3:18 PM, Justine Olshan  
> wrote:
> 
> Thanks Kirk!
> I will try to go through in the next day or so and see if there is any
> tests I can fix.
> 
> On Tue, Aug 8, 2023 at 3:13 PM Kirk True  wrote:
> 
>> Hi Justine,
>> 
>>> On Aug 1, 2023, at 4:50 PM, Justine Olshan 
>> wrote:
>>> 
>>> Is that right that the first one on the list (
>>> 
>> org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationExactlyOnceTest)
>>> takes
>>> 20 minutes?! That's quite a test.
>>> I wonder if the length corresponds to whether it passes, but we should
>> fix
>>> it and maybe move it out of our PR builds.
>> 
>> It certainly does seems like it is taking that long PLUS any retries since
>> it’s flaky.
>> 
>>> I was also wondering if we could distinguish PR builds from trunk builds.
>>> That might give us a better signal since PR builds could be before tests
>>> are fixed. Not sure which one is being reported here.
>> 
>> It includes both trunk and PR branch test runs. I’ll see how easy it is to
>> filter them out.
>> 
>> Thanks!
>> 
>>> 
>>> Thanks for sharing though! This is a useful tool that we've needed for a
>>> while.
>>> 
>>> Justine
>>> 
>>> On Tue, Aug 1, 2023 at 4:38 PM Kirk True  wrote:
>>> 
 Hi!
 
 According to the Gradle Enterprise statistics on our recent Kafka
>> builds,
 over 90% have flaky tests [1].
 
 We also have 106 open Jiras with the “flaky-test” label across several
 functional areas of the project [2].
 
 Can I ask that those familiar with those different functional areas
>> take a
 look at the list of flaky tests and triage them?
 
 Thanks,
 Kirk
 
 [1]
 
>> https://ge.apache.org/scans/tests?search.relativeStartTime=P28D=kafka
 [2]
 
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20labels%20%3D%20flaky-test
>> 
>> 



Re: Flaky tests need attention (clients, Connect, Mirror Maker, Streams, etc.)

2023-08-08 Thread Justine Olshan
Thanks Kirk!
I will try to go through in the next day or so and see if there is any
tests I can fix.

On Tue, Aug 8, 2023 at 3:13 PM Kirk True  wrote:

> Hi Justine,
>
> > On Aug 1, 2023, at 4:50 PM, Justine Olshan 
> wrote:
> >
> > Is that right that the first one on the list (
> >
> org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationExactlyOnceTest)
> > takes
> > 20 minutes?! That's quite a test.
> > I wonder if the length corresponds to whether it passes, but we should
> fix
> > it and maybe move it out of our PR builds.
>
> It certainly does seems like it is taking that long PLUS any retries since
> it’s flaky.
>
> > I was also wondering if we could distinguish PR builds from trunk builds.
> > That might give us a better signal since PR builds could be before tests
> > are fixed. Not sure which one is being reported here.
>
> It includes both trunk and PR branch test runs. I’ll see how easy it is to
> filter them out.
>
> Thanks!
>
> >
> > Thanks for sharing though! This is a useful tool that we've needed for a
> > while.
> >
> > Justine
> >
> > On Tue, Aug 1, 2023 at 4:38 PM Kirk True  wrote:
> >
> >> Hi!
> >>
> >> According to the Gradle Enterprise statistics on our recent Kafka
> builds,
> >> over 90% have flaky tests [1].
> >>
> >> We also have 106 open Jiras with the “flaky-test” label across several
> >> functional areas of the project [2].
> >>
> >> Can I ask that those familiar with those different functional areas
> take a
> >> look at the list of flaky tests and triage them?
> >>
> >> Thanks,
> >> Kirk
> >>
> >> [1]
> >>
> https://ge.apache.org/scans/tests?search.relativeStartTime=P28D=kafka
> >> [2]
> >>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20labels%20%3D%20flaky-test
>
>


Re: Flaky tests need attention (clients, Connect, Mirror Maker, Streams, etc.)

2023-08-08 Thread Kirk True
Hi Justine,

> On Aug 1, 2023, at 4:50 PM, Justine Olshan  
> wrote:
> 
> Is that right that the first one on the list (
> org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationExactlyOnceTest)
> takes
> 20 minutes?! That's quite a test.
> I wonder if the length corresponds to whether it passes, but we should fix
> it and maybe move it out of our PR builds.

It certainly does seems like it is taking that long PLUS any retries since it’s 
flaky. 

> I was also wondering if we could distinguish PR builds from trunk builds.
> That might give us a better signal since PR builds could be before tests
> are fixed. Not sure which one is being reported here.

It includes both trunk and PR branch test runs. I’ll see how easy it is to 
filter them out.

Thanks!

> 
> Thanks for sharing though! This is a useful tool that we've needed for a
> while.
> 
> Justine
> 
> On Tue, Aug 1, 2023 at 4:38 PM Kirk True  wrote:
> 
>> Hi!
>> 
>> According to the Gradle Enterprise statistics on our recent Kafka builds,
>> over 90% have flaky tests [1].
>> 
>> We also have 106 open Jiras with the “flaky-test” label across several
>> functional areas of the project [2].
>> 
>> Can I ask that those familiar with those different functional areas take a
>> look at the list of flaky tests and triage them?
>> 
>> Thanks,
>> Kirk
>> 
>> [1]
>> https://ge.apache.org/scans/tests?search.relativeStartTime=P28D=kafka
>> [2]
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20labels%20%3D%20flaky-test



Re: Debugging Jenkins test failures

2023-08-08 Thread Kirk True
I created this INFRA Jira just now to see if they can help resolve some of the 
intermittent Jenkins build issues:

https://issues.apache.org/jira/browse/INFRA-24874


> On Aug 1, 2023, at 4:04 PM, Kirk True  wrote:
> 
> Hi Divij,
> 
> Thanks for the pointer to Gradle Enterprise! That’s exactly what I was 
> looking for.
> 
> Did we track builds before July 12? I see only tiny blips of failures on the 
> 90-day view.
> 
> Thanks,
> Kirk
> 
>> On Jul 26, 2023, at 2:08 AM, Divij Vaidya  wrote:
>> 
>> Hi Kirk
>> 
>> I have been using this new tool to analyze the trends of test
>> failures: 
>> https://ge.apache.org/scans/tests?search.relativeStartTime=P28D=kafka=Europe/Berlin
>> and general build failures:
>> https://ge.apache.org/scans/failures?search.relativeStartTime=P28D=kafka=Europe/Berlin
>> 
>> About the classes of build failure, if we look at the last 28 days, I
>> do not observe an increasing trend. The top causes of failure are:
>> (link [2])
>> 1. Failures due to checkstyle (193 builds)
>> 2. Timeout waiting to lock cache. It is currently in-use by another
>> Gradle instance.
>> 3. Compilation failures (116 builds)
>> 4. "Gradle Test Executor" finished with a non-zero exit value. Process
>> 'Gradle Test Executor 180' finished with non-zero exit value 1
>> 
>> #4 is caused by a test failure that causes a crash of the Gradle
>> process. To debug this, I usually go to complete test output and try
>> to figure out which was the last test that 'Gradle Test Executor 180'
>> was running. As an example, consider
>> https://ge.apache.org/s/luizhogirob4e. We observe that this fails for
>> PR-14094. Now, we need to see the complete system out. To find that, I
>> will go to Kafka PR builder at
>> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/view/change-requests/
>> and find the build page for PR-14094. That page is
>> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14094/.
>> Next, find last failed build at
>> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14094/lastFailedBuild/
>> , observe that we have a failure for "Gradle Test Executor 177", click
>> on view as plain text (it takes a long time to load), find what the
>> GradleTest Executor was doing. In this case, it failed with the
>> following error. I strongly believe that it is due to
>> https://github.com/apache/kafka/pull/13572 but unfortunately, this was
>> reverted and never fixed after that. Perhaps you might want to re
>> 
>> Gradle Test Run :core:integrationTest > Gradle Test Executor 177 >
>> ProducerFailureHandlingTest > testTooLargeRecordWithAckZero() STARTED
>> 
>>> Task :clients:integrationTest FAILED
>> org.gradle.internal.remote.internal.ConnectException: Could not
>> connect to server [bd7b0504-7491-43f8-a716-513adb302c92 port:43321,
>> addresses:[/127.0.0.1]]. Tried addresses: [/127.0.0.1].
>> at 
>> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67)
>> at 
>> org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36)
>> at 
>> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103)
>> at 
>> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
>> at 
>> worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
>> at 
>> worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
>> Caused by: java.net.ConnectException: Connection refused
>> at java.base/sun.nio.ch.Net.pollConnect(Native Method)
>> at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
>> at 
>> java.base/sun.nio.ch.SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1141)
>> at 
>> java.base/sun.nio.ch.SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1183)
>> at java.base/sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:98)
>> at 
>> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81)
>> at 
>> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54)
>> ... 5 more
>> 
>> 
>> 
>> 
>> About the classes of test failure problems, if we look at the last 28
>> days, the following tests are the biggest culprits. If we fix just
>> these two, our CI would be in a much better shape. (link [1])
>> 1. https://issues.apache.org/jira/browse/KAFKA-15197 (this test passes
>> only 53% of the time)
>> 2. https://issues.apache.org/jira/browse/KAFKA-15052 (this test passes
>> only 49% of the time)
>> 
>> 
>> [1] 
>> https://ge.apache.org/scans/tests?search.relativeStartTime=P28D=kafka=Europe/Berlin
>> [2] 
>> https://ge.apache.org/scans/failures?search.relativeStartTime=P28D=kafka=Europe/Berlin
>> 
>> 
>> --
>> Divij Vaidya
>> 
>> On Tue, Jul 25, 2023 at 8:09 PM Kirk True  wrote:
>>> 
>>> Hi all!
>>> 
>>> I’ve noticed that we’re back in the state 

Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2076

2023-08-08 Thread Apache Jenkins Server
See 




Re: [DISCUSS] KIP-960: Support interactive queries (IQv2) for versioned state stores

2023-08-08 Thread Lucas Brutschy
Hi Alieh,

thanks a lot for the KIP. IQ with time semantics is going to be
another great improvement towards having crystal clear streaming
semantics!

1. I agree with Bruno and Matthias, to remove the 'bound' term for the
timestamps. It's confusing that we have bounds for both timestamps and
keys. In particular, `withNoBoundWithTimestampBound` seems to
contradict itself.

2. I would personally prefer having composable construction of the
query, instead of defining a separate method for each combination. So
for example:
- `keyRangeLatestValue(l,u)` ->  `withBounds(l, u).latest()`
- `withNoBoundsWithTimestampRange(t1,t2)` ->
`withNoBounds().fromTime(t1).untilTime(t2)`
- etc.pp.
This would have the advantage, that the interface would be very
similar to `RangeQuery` and we'd need a lot fewer methods, so it will
make the API reference a much quicker read. We already use this style
to define `skipCache` in `KeyQuery`. I guess that diverges quite a bit
from the current proposal, but I'll leave it here anyways for you to
consider it (even if you decide to stick with the current model).

4. Please make sure to specify in every range-based method whether the
bounds are inclusive or exclusive. I see it being mentioned for some
methods, but for others, this is omitted. As I understand, 'until' is
usually used to mean exclusive, and 'from' is usually used to mean
inclusive, but it's better to specify this in the javadoc.

5. Similarly, as Matthias says, specify what happens if the "validity
range" of a value overlaps with the query range. So, to clarify his
remark, what happens if the value v1 is inserted at time 1 and value
v2 is inserted at time 3, and I query for the range `[2,4]` - will the
result include v1 or not? It's the valid value at time 2. For
inspiration, in `WindowRangeQuery`, this important semantic detail is
even clear from the method name `withWindowStartRange`.

6. For iterators, it is convention to call the method `peek` and this
convention followed by e.g. `AbstractIterator` in Kafka, but also
Guava, Apache Commons etc. So I would also call it `peek`, not
`peekNextValue` here. It's clear what we are peeking at.

Cheers,
Lucas

On Thu, Jul 27, 2023 at 3:07 PM Alieh Saeedi
 wrote:
>
> Thanks, Bruno, for the feedback.
>
>
>- I agree with both points 2 and 3. About 3: Having "VersionsQualifier"
>reduces the number of methods and makes everything less confusing. At the
>end, that will be easier to use for the developers.
>- About point 4: I renamed all the properties and parameters from
>"asOfTimestamp" to "fromTimestamp". That was my misunderstanding. So Now we
>have these two timestamp bounds: "fromTimestamp" and "untilTimestamp".
>- About point 5: Do we need system tests here? I assumed just
>integration tests were enough.
>- Regarding long vs timestamp instance: I think yes, that 's why I used
>Long as timestamp.
>
> Bests,
> Alieh
>
>
>
>
>
>
> On Thu, Jul 27, 2023 at 2:28 PM Bruno Cadonna  wrote:
>
> > Hi Alieh,
> >
> > Thanks for the KIP!
> >
> >
> > Here my feedback.
> >
> > 1.
> > You can remove the private fields and constructors from the KIP. Those
> > are implementation details.
> >
> >
> > 2.
> > Some proposals for renamings
> >
> > in VersionedKeyQuery
> >
> > withKeyWithTimestampBound()
> >-> withKeyAndAsOf()
> >
> > withKeyWithTimestampRange()
> >-> withKeyAndTimeRange()
> >
> > in VersionedRangeQuery
> >
> > KeyRangeWithTimestampBound()
> >-> withKeyRangeAndAsOf()
> >
> > withLowerBoundWithTimestampBound()
> >-> withLowerBoundAndAsOf()
> >
> > withUpperBoundWithTimestampBound()
> >-> withUpperBoundAndAsOf()
> >
> > withNoBoundWithTimestampBound()
> >-> withNoBoundsAndAsOf
> >
> > keyRangeWithTimestampRange()
> >-> withKeyRangeAndTimeRange()
> >
> > withLowerBoundWithTimestampRange()
> >-> withLowerBoundAndTimeRange()
> >
> > withUpperBoundWithTimestampRange()
> >-> withUpperBounfAndTimeRange()
> >
> > withNoBoundWithTimestampRange()
> >-> withNoBoundsAndTimeRange()
> >
> >
> > 3.
> > Would it make sense to merge
> > withKeyLatestValue(final K key)
> > and
> > withKeyAllVersions(final K key)
> > into
> > withKey(final K key, final VersionsQualifier versionsQualifier)
> > where VersionsQualifier is an enum with values (ALL, LATEST). We could
> > also add EARLIEST if we feel it might be useful.
> > Same applies to all methods that end in LatestValue or AllVersions
> >
> >
> > 4.
> > I think getAsOfTimestamp() should not return the lower bound. If I query
> > a version as of a timestamp then the query should return the latest
> > version less than the timestamp.
> > I propose to rename the getters to getTimeFrom() and getTimeTo() as in
> > WindowRangeQuery.
> >
> >
> > 5.
> > Please add the Test Plan section.
> >
> >
> > Regarding long vs Instant: Did we miss to use Instant instead of long
> > for all interfaces of the versioned state stores?
> >
> >
> > Best,
> > Bruno
> >
> >
> >
> >
> >
> >
> >
> >
> > On 

[jira] [Created] (KAFKA-15318) Move Acl publishing outside the QuorumController

2023-08-08 Thread Colin McCabe (Jira)
Colin McCabe created KAFKA-15318:


 Summary: Move Acl publishing outside the QuorumController
 Key: KAFKA-15318
 URL: https://issues.apache.org/jira/browse/KAFKA-15318
 Project: Kafka
  Issue Type: Bug
Reporter: Colin McCabe


On the controller, move Acl publishing into a dedicated MetadataPublisher, 
AclPublisher. This publisher listens for notifications from MetadataLoader, and 
receives only committed data. This brings the controller side in line with how 
the broker has always worked. It also avoids some ugly code related to 
publishing directly from the QuorumController. Most important of all, it clears 
the way to implement metadata transactions without worrying about Authorizer 
state (since it will be handled by the MetadataLoader, along with other 
metadata image state).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15317) Fix for async consumer access to committed offsets with multiple consumers

2023-08-08 Thread Lianet Magrans (Jira)
Lianet Magrans created KAFKA-15317:
--

 Summary: Fix for async consumer access to committed offsets with 
multiple consumers
 Key: KAFKA-15317
 URL: https://issues.apache.org/jira/browse/KAFKA-15317
 Project: Kafka
  Issue Type: Task
  Components: clients, consumer
Reporter: Lianet Magrans


Access to the committed offsets via a call to the _committed_ API func works as 
expected for a single async consumer, but it some times fails with timeout when 
trying to retrieve the committed offsets with another consumer in the same 
group. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2075

2023-08-08 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 385714 lines...]

Gradle Test Run :core:integrationTest > Gradle Test Executor 192 > 
SaslOAuthBearerSslEndToEndAuthorizationTest > 
testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl(String, boolean) > [4] 
quorum=zk, isIdempotenceEnabled=false STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 192 > 
SaslOAuthBearerSslEndToEndAuthorizationTest > 
testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl(String, boolean) > [4] 
quorum=zk, isIdempotenceEnabled=false PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 192 > 
TransactionsTest > testFailureToFenceEpoch(String) > 
testFailureToFenceEpoch(String).quorum=zk STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 191 > 
SslAdminIntegrationTest > 
testAsynchronousAuthorizerAclUpdatesDontBlockRequestThreads() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 191 > 
SslAdminIntegrationTest > 
testSynchronousAuthorizerAclUpdatesBlockRequestThreads() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 191 > 
SslAdminIntegrationTest > 
testSynchronousAuthorizerAclUpdatesBlockRequestThreads() PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 192 > 
TransactionsTest > testFailureToFenceEpoch(String) > 
testFailureToFenceEpoch(String).quorum=zk PASSED

Gradle Test Run :core:integrationTest > Gradle Test Executor 192 > 
TransactionsTest > testFailureToFenceEpoch(String) > 
testFailureToFenceEpoch(String).quorum=kraft STARTED

Gradle Test Run :connect:mirror:integrationTest > Gradle Test Executor 128 > 
MirrorConnectorsIntegrationTransactionsTest > 
testOffsetTranslationBehindReplicationFlow() PASSED

Gradle Test Run :connect:mirror:integrationTest > Gradle Test Executor 128 > 
MirrorConnectorsIntegrationTransactionsTest > 
testNoCheckpointsIfNoRecordsAreMirrored() STARTED

Gradle Test Run :core:integrationTest > Gradle Test Executor 192 > 
TransactionsTest > testFailureToFenceEpoch(String) > 
testFailureToFenceEpoch(String).quorum=kraft PASSED

2039 tests completed, 8 failed, 4 skipped
There were failing tests. See the report at: 
file:///home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk/core/build/reports/tests/integrationTest/index.html

Deprecated Gradle features were used in this build, making it incompatible with 
Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings 
and determine if they come from your own scripts or plugins.

For more on this, please refer to 
https://docs.gradle.org/8.2.1/userguide/command_line_interface.html#sec:command_line_warnings
 in the Gradle documentation.

BUILD SUCCESSFUL in 3h 48m 41s
231 actionable tasks: 124 executed, 107 up-to-date

Publishing build scan...
https://ge.apache.org/s/2hafwhveieh5g


See the profiling report at: 
file:///home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk/build/reports/profile/profile-2023-08-08-15-10-30.html
A fine-grained performance profile is available: use the --scan option.
[Pipeline] junit
Recording test results

Gradle Test Run :connect:mirror:integrationTest > Gradle Test Executor 128 > 
MirrorConnectorsIntegrationTransactionsTest > 
testNoCheckpointsIfNoRecordsAreMirrored() PASSED

Gradle Test Run :connect:mirror:integrationTest > Gradle Test Executor 128 > 
MirrorConnectorsIntegrationTransactionsTest > 
testOneWayReplicationWithAutoOffsetSync() STARTED

Gradle Test Run :connect:mirror:integrationTest > Gradle Test Executor 129 > 
MirrorConnectorsWithCustomForwardingAdminIntegrationTest > 
testReplicationWithEmptyPartition() PASSED

Gradle Test Run :connect:mirror:integrationTest > Gradle Test Executor 129 > 
MirrorConnectorsWithCustomForwardingAdminIntegrationTest > 
testReplicateSourceDefault() STARTED
[Checks API] No suitable checks publisher found.
[Pipeline] echo
Skipping Kafka Streams archetype test for Java 17
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // timestamps
[Pipeline] }
[Pipeline] // timeout
[Pipeline] }
[Pipeline] // stage
[Pipeline] }

Gradle Test Run :connect:mirror:integrationTest > Gradle Test Executor 129 > 
MirrorConnectorsWithCustomForwardingAdminIntegrationTest > 
testReplicateSourceDefault() PASSED

Gradle Test Run :connect:mirror:integrationTest > Gradle Test Executor 129 > 
MirrorConnectorsWithCustomForwardingAdminIntegrationTest > 
testOffsetSyncsTopicsOnTarget() STARTED

Gradle Test Run :connect:mirror:integrationTest > Gradle Test Executor 128 > 
MirrorConnectorsIntegrationTransactionsTest > 
testOneWayReplicationWithAutoOffsetSync() PASSED

Gradle Test Run :connect:mirror:integrationTest > Gradle Test Executor 128 > 
MirrorConnectorsIntegrationTransactionsTest > testSyncTopicConfigs() STARTED

Gradle Test Run 

[jira] [Resolved] (KAFKA-15126) Change range queries to accept null lower and upper bounds

2023-08-08 Thread Bill Bejeck (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bejeck resolved KAFKA-15126.
-
Resolution: Fixed

[Merged to trunk|https://github.com/apache/kafka/pull/14137]

> Change range queries to accept null lower and upper bounds
> --
>
> Key: KAFKA-15126
> URL: https://issues.apache.org/jira/browse/KAFKA-15126
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Lucia Cerchie
>Assignee: Lucia Cerchie
>Priority: Minor
> Fix For: 3.6.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> {color:#1d1c1d}When web client requests come in with query params, it's 
> common for those params to be null. We want developers to just be able to 
> pass in the upper/lower bounds if they want instead of implementing their own 
> logic to avoid getting the whole range (which will happen if they leave the 
> params null). {color}
> {color:#1d1c1d}An example of the logic they can avoid using after this KIP is 
> implemented is below:{color}
> {code:java}
> private RangeQuery> 
> createRangeQuery(String lower, String upper) {
> if (isBlank(lower) && isBlank(upper)) {
> return RangeQuery.withNoBounds();
> } else if (!isBlank(lower) && isBlank(upper)) {
> return RangeQuery.withLowerBound(lower);
> } else if (isBlank(lower) && !isBlank(upper)) {
> return RangeQuery.withUpperBound(upper);
> } else {
> return RangeQuery.withRange(lower, upper);
> }
> } {code}
>  
> | |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15316) CommitRequestManager not calling RequestState callbacks

2023-08-08 Thread Lianet Magrans (Jira)
Lianet Magrans created KAFKA-15316:
--

 Summary: CommitRequestManager not calling RequestState callbacks 
 Key: KAFKA-15316
 URL: https://issues.apache.org/jira/browse/KAFKA-15316
 Project: Kafka
  Issue Type: Bug
  Components: clients, consumer
Reporter: Lianet Magrans


CommitRequestManager is not triggering the RequestState callbacks that update 
{_}lastReceivedMs{_}, affecting the _canSendRequest_ verification of the 
RequestState



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

2023-08-08 Thread Kamal Chandraprakash
Hi Christo,

Thanks for the KIP!

The proposed tiered storage metrics are useful. The unit mentioned in the
KIP is the number of records.
Each topic can have varying amounts of records in a segment depending on
the record size.

Do you think having the tier-lag by number of segments (or) size of
segments in bytes will be useful
to the operator?

Thanks,
Kamal

On Tue, Aug 8, 2023 at 8:56 PM Christo Lolov  wrote:

> Hello all!
>
> I would like to start a discussion for KIP-963: Upload and delete lag
> metrics in Tiered Storage (https://cwiki.apache.org/confluence/x/sZGzDw).
>
> The purpose of this KIP is to introduce a couple of metrics to track lag
> with respect to remote storage from the point of view of Kafka.
>
> Thanks in advance for leaving a review!
>
> Best,
> Christo
>


[DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

2023-08-08 Thread Christo Lolov
Hello all!

I would like to start a discussion for KIP-963: Upload and delete lag
metrics in Tiered Storage (https://cwiki.apache.org/confluence/x/sZGzDw).

The purpose of this KIP is to introduce a couple of metrics to track lag
with respect to remote storage from the point of view of Kafka.

Thanks in advance for leaving a review!

Best,
Christo


Re: [DISCUSSION] KIP-965: Support disaster recovery between clusters by MirrorMaker

2023-08-08 Thread Ryanne Dolan
hudeqi, I'd call the configuration property something that describes what
it does rather than it's intended use-case.

Ryanne

On Tue, Aug 8, 2023, 4:46 AM hudeqi <16120...@bjtu.edu.cn> wrote:

> Hi, all. I want to submit a kip, and hope get some review and good
> suggestions. the kip is here: https://cwiki.apache.org/confluence/x/k5KzDw
>
> Motivation:
>
>
> When mirroring ACLs, MirrorMaker downgrades allow ALL ACLs to allow READ.
> The rationale to is prevent other clients to produce to remote topics,
> which is mentioned in KIP-382: MirrorMaker 2.0.
>
> However in disaster recovery scenarios, where the target cluster is not
> used and just a "hot standby", it would be preferable to have exactly the
> same ACLs on both clusters to speed up failover. Therefore, in this
> scenario, we need to synchronize the topic write ACL, group ACL, and
> absolute user scram credential of the source cluster topic to the target
> cluster, so that when the user directly switches the read and write service
> to the target cluster, it can be ran directly.
>
> Proposed changes:
>
> Add a config parameter: disaster.recovery.enabled in MirrorMakerConfig,
> the default is false, it will leave the current sync behavior unchanged, if
> set true, it will synchronize the topic write ACL, group ACL, and
> absolute user scram credential of the source cluster replicated topics to
> the target cluster.
>
> topic write ACL: Filter all topic read Acl informations related
> to the topics replicated with the source cluster.
> user scram credential: Filter the user scram credential to be synchronized
> according to the topic acl information to be synchronized and create user
> in target cluster.
> group ACL: The group Acl information is obtained by filtering the user
> obtained above.
>
> Looking forward to your reply.
>
> Best, hudeqi


[jira] [Created] (KAFKA-15315) Use getOrDefault rather than get

2023-08-08 Thread roon (Jira)
roon created KAFKA-15315:


 Summary: Use getOrDefault rather than get
 Key: KAFKA-15315
 URL: https://issues.apache.org/jira/browse/KAFKA-15315
 Project: Kafka
  Issue Type: Improvement
  Components: clients
Reporter: roon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[QUESTION] about topic dimension quota

2023-08-08 Thread hudeqi
Hi,all. Let me ask a question first, that is, do we plan to support quota in 
the topic dimension?

[jira] [Created] (KAFKA-15314) No Quota applied if client-id is null or empty

2023-08-08 Thread Jorge Esteban Quilcate Otoya (Jira)
Jorge Esteban Quilcate Otoya created KAFKA-15314:


 Summary: No Quota applied if client-id is null or empty
 Key: KAFKA-15314
 URL: https://issues.apache.org/jira/browse/KAFKA-15314
 Project: Kafka
  Issue Type: Bug
  Components: core
Reporter: Jorge Esteban Quilcate Otoya


When Quotas where proposed, KIP-13[1] stated:

>  In addition, there will be a quota reserved for clients not presenting a 
>client id (for e.g. simple consumers not setting the id). This will default to 
>an empty client id ("") and all such clients will share the quota for that 
>empty id (which should be the default quota).

Though, seems that when client-id is null or empty and a default quota for 
client-id is present, no quota is applied.

Even though Java clients set a default value [2][3], the protocol accepts null 
client-id[4], and other clients implementations could send a null value to 
by-pass a quota.

Related code[5][6] shows that preparing metric pair for quotas with client-id 
(potentially null) and setting quota to null when both client-id and (sanitize) 
user are null.

Adding some tests to showcase this: 
[https://github.com/apache/kafka/pull/14165] 

 

Is it expected for client-id=null to by-pass quotas? If it is, then KIP or 
documentation to clarify this; otherwise we should amend this behavior bug. e.g 
we could "sanitize" client-id similar to user name to be empty string when 
input is null or empty.


 

As a side-note, similar behavior could happen with user I guess. Even though 
value is default to ANONYMOUS, if a client implementation sends empty value, it 
may as well by-pass the default quota – though I need to further test this once 
this is considered a bug.

 

[1]: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas] 


[2]: 
[https://github.com/apache/kafka/blob/e98508747acc8972ac5ceb921e0fd3a7d7bd5e9c/clients/src/main/java/org/apache/kafka/clients/producer/ProducerConfig.java#L498-L508]
 


[3]: 
[https://github.com/apache/kafka/blob/ab71c56973518bac8e1868eccdc40b17d7da35c1/clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java#L616-L628]

[4]: 
[https://github.com/apache/kafka/blob/9f26906fcc2fd095b7d27c504e342b9a8d619b4b/clients/src/main/resources/common/message/RequestHeader.json#L34-L40]
 


[5]: 
[https://github.com/apache/kafka/blob/322ac86ba282f35373382854cc9e790e4b7fb5fc/core/src/main/scala/kafka/server/ClientQuotaManager.scala#L588-L628]
 


[6]: 
[https://github.com/apache/kafka/blob/322ac86ba282f35373382854cc9e790e4b7fb5fc/core/src/main/scala/kafka/server/ClientQuotaManager.scala#L651-L652]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-959 Add BooleanConverter to Kafka Connect

2023-08-08 Thread Mickael Maison
Hi,

+1 (binding)

Thanks for the KIP!

On Mon, Aug 7, 2023 at 3:15 PM Hector Geraldino (BLOOMBERG/ 919 3RD A)
 wrote:
>
> Hello,
>
> I still need help from a committer to review/approve this (small) KIP, which 
> adds a new BooleanConverter to the list of converters in Kafka Connect.
>
> The KIP has a companion PR implementing the feature as well.
>
> Thanks again!
> Sent from Bloomberg Professional for iPhone
>
> - Original Message -
> From: Hector Geraldino 
> To: dev@kafka.apache.org
> At: 08/01/23 11:48:23 UTC-04:00
>
>
> Hi,
>
> Still missing one binding vote for this (very small) KIP to pass :)
>
> From: dev@kafka.apache.org At: 07/28/23 09:37:45 UTC-4:00To:  
> dev@kafka.apache.org
> Subject: Re: [VOTE] KIP-959 Add BooleanConverter to Kafka Connect
>
> Hi everyone,
>
> Thanks everyone who has reviewed and voted for this KIP.
>
> So far it has received 3 non-binding votes (Andrew Schofield, Yash Mayya, 
> Kamal
> Chandraprakash) and 2 binding votes (Chris Egerton, Greg Harris)- still shy of
> one binding vote to pass.
>
> Can we get help from a committer to push it through?
>
> Thank you!
> Hector
>
> Sent from Bloomberg Professional for iPhone
>
> - Original Message -
> From: Greg Harris 
> To: dev@kafka.apache.org
> At: 07/26/23 12:23:20 UTC-04:00
>
>
> Hey Hector,
>
> Thanks for the straightforward and clear KIP!
> +1 (binding)
>
> Thanks,
> Greg
>
> On Wed, Jul 26, 2023 at 5:16 AM Chris Egerton  wrote:
> >
> > +1 (binding)
> >
> > Thanks Hector!
> >
> > On Wed, Jul 26, 2023 at 3:18 AM Kamal Chandraprakash <
> > kamal.chandraprak...@gmail.com> wrote:
> >
> > > +1 (non-binding). Thanks for the KIP!
> > >
> > > On Tue, Jul 25, 2023 at 11:12 PM Yash Mayya  wrote:
> > >
> > > > Hi Hector,
> > > >
> > > > Thanks for the KIP!
> > > >
> > > > +1 (non-binding)
> > > >
> > > > Thanks,
> > > > Yash
> > > >
> > > > On Tue, Jul 25, 2023 at 11:01 PM Andrew Schofield <
> > > > andrew_schofield_j...@outlook.com> wrote:
> > > >
> > > > > Thanks for the KIP. As you say, not that controversial.
> > > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Thanks,
> > > > > Andrew
> > > > >
> > > > > > On 25 Jul 2023, at 18:22, Hector Geraldino (BLOOMBERG/ 919 3RD A) <
> > > > > hgerald...@bloomberg.net> wrote:
> > > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > The changes proposed by KIP-959 (Add BooleanConverter to Kafka
> > > Connect)
> > > > > have a limited scope and shouldn't be controversial. I'm opening a
> > > voting
> > > > > thread with the hope that it can be included in the next upcoming 3.6
> > > > > release.
> > > > > >
> > > > > > Here are some links:
> > > > > >
> > > > > > KIP:
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-959%3A+Add+BooleanConverte
> r+to+Kafka+Connect
> > > > > > JIRA: https://issues.apache.org/jira/browse/KAFKA-15248
> > > > > > Discussion thread:
> > > > > https://lists.apache.org/thread/15c2t0kl9bozmzjxmkl5n57kv4l4o1dt
> > > > > > Pull Request: https://github.com/apache/kafka/pull/14093
> > > > > >
> > > > > > Thanks!
> > > > >
> > > > >
> > > > >
> > > >
> > >


[jira] [Created] (KAFKA-15313) Delete remote log segments partition asynchronously when a partition is deleted.

2023-08-08 Thread Satish Duggana (Jira)
Satish Duggana created KAFKA-15313:
--

 Summary: Delete remote log segments partition asynchronously when 
a partition is deleted. 
 Key: KAFKA-15313
 URL: https://issues.apache.org/jira/browse/KAFKA-15313
 Project: Kafka
  Issue Type: Task
  Components: core
Reporter: Satish Duggana
Assignee: Abhijeet Kumar


KIP-405 already covers the approach to delete remote log segments 
asynchronously through controller and RLMM layers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-714: Client metrics and observability

2023-08-08 Thread Milind Luthra
Hi Andrew, thanks for working on the KIP.

+1 (non binding)

Thanks,
Milind

On Fri, Aug 4, 2023 at 2:16 PM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi,
> After almost 2 1/2 years in the making, I would like to call a vote for
> KIP-714 (
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> ).
>
> This KIP aims to improve monitoring and troubleshooting of client
> performance by enabling clients to push metrics to brokers.
>
> I’d like to thank everyone that participated in the discussion, especially
> the librdkafka team since one of the aims of the KIP is to enable any
> client to participate, not just the Apache Kafka project’s Java clients.
>
> Thanks,
> Andrew


[DISCUSSION] KIP-965: Support disaster recovery between clusters by MirrorMaker

2023-08-08 Thread hudeqi
Hi, all. I want to submit a kip, and hope get some review and good suggestions. 
the kip is here: https://cwiki.apache.org/confluence/x/k5KzDw

Motivation:


When mirroring ACLs, MirrorMaker downgrades allow ALL ACLs to allow READ. The 
rationale to is prevent other clients to produce to remote topics, which is 
mentioned in KIP-382: MirrorMaker 2.0.

However in disaster recovery scenarios, where the target cluster is not used 
and just a "hot standby", it would be preferable to have exactly the same ACLs 
on both clusters to speed up failover. Therefore, in this scenario, we need to 
synchronize the topic write ACL, group ACL, and absolute user scram 
credential of the source cluster topic to the target cluster, so that when the 
user directly switches the read and write service to the target cluster, it can 
be ran directly.

Proposed changes:

Add a config parameter: disaster.recovery.enabled in MirrorMakerConfig, the 
default is false, it will leave the current sync behavior unchanged, if set 
true, it will synchronize the topic write ACL, group ACL, and absolute 
user scram credential of the source cluster replicated topics to the target 
cluster.

topic write ACL: Filter all topic read Acl informations related to 
the topics replicated with the source cluster.
user scram credential: Filter the user scram credential to be synchronized 
according to the topic acl information to be synchronized and create user in 
target cluster.
group ACL: The group Acl information is obtained by filtering the user obtained 
above.

Looking forward to your reply.

Best, hudeqi

Re: [DISCUSS] KIP-964: Have visibility when produce requests become "async".

2023-08-08 Thread Sergio Daniel Troiano
Hi everyone!

I have a doubt regarding where the metric should go, I need to know where
the local append happens in the active segment log.

I saw some files which "append" on the segment log but I am a bit confusing
which one is the one:
core/src/main/scala/kafka/log/LocalLog.scala
core/src/main/scala/kafka/log/LogSegment.scala
clients/src/main/java/org/apache/kafka/common/record/MemoryRecords.java

At the end I need to identify exactly where the data is appended on the
active segment, this should be only the local append as we are interested
on measuring the local latency of a write in the OS (sync vs async)

Thanks in advance!!!


On Mon, 7 Aug 2023 at 14:43, Sergio Daniel Troiano <
sergio.troi...@adevinta.com> wrote:

> Hi everyone!
>
> I would like to start a discuss thread for this KIP
> 
>
>
> Thanks
>


Re: [DISCUSS] KIP-714: Client metrics and observability

2023-08-08 Thread Doğuşcan Namal
Thanks for your answers Andrew. I share your pain that it took a while for
you to get this KIP approved and you want to reduce the scope of it, will
be happy to help you with the implementation :)

Could you help me walk through what happens if the target broker is
unreachable? Is the client going to drop these metrics or is it going to
send it to the other brokers it is connected to? This information is
crucial to understand the client side impact on leadership failovers.
Moreover, in case of partial outages, such as only the network between the
client and the broker is partitioned whereas the network within the cluster
is healthy, practically there is no other way than the client side metrics
to identify this problem.

Doguscan

On Fri, 4 Aug 2023 at 15:33, Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Doguscan,
> Thanks for your comments. I’m glad to hear you’re interested in this KIP.
>
> 1) It is preferred that a client sends its metrics to the same broker
> connection
> but actually it is able to send them to any broker. As a result, if a
> broker becomes
> unhealthy, the client can push its metrics to any other broker. It seems
> to me that
> pushing to KRaft controllers instead just has the effect of increasing the
> load on
> the controllers, while still having the characteristic that an unhealthy
> controller
> would present inconvenience for collecting metrics.
>
> 2) When the `PushTelemetryRequest.Terminating` flag is set, the standard
> request
> throttling is not disabled. The metrics rate-limiting based on the push
> interval is
> not applied in this case for a single request for the combination of
> client instance ID
> and subscription ID.
>
> (I have corrected the KIP text because it erroneously said “client ID and
> subscription ID”.
>
> 3) While this is a theoretical problem, I’m not keen on adding yet more
> configurations
> to the broker or client. The `interval.ms` configuration on the
> CLIENT_METRICS
> resource could perhaps have a minimum and maximum value to prevent
> accidental
> misconfiguration.
>
> 4) One of the reasons that this KIP has taken so long to get to this stage
> is that
> it tried to do many things all at once. So, it’s greatly simplified
> compared with
> 6 months ago. I can see the value of collecting client configurations for
> problem
> determination, but I don’t want to make this KIP more complicated. I think
> the
> idea has merit as a separate follow-on KIP. I would be happy to collaborate
> with you on this.
>
> 5) The default is set to 5 minutes to minimise the load on the broker for
> situations
> in which the administrator didn’t set an interval on a metrics
> subscription. To
> use an interval of 1 minute, it is only necessary to set `interval.ms` on
> the metrics
> subscription to 6ms.
>
> 6) Uncompressed data is always supported. The KIP says:
>  "The CompressionType of NONE will not be
> "present in the response from the broker, though the broker does support
> uncompressed
> "client telemetry if none of the accepted compression codecs are supported
> by the client.”
> So in your example, the client need only use CompressionType=NONE.
>
> Thanks,
> Andrew
>
> > On 4 Aug 2023, at 14:04, Doğuşcan Namal 
> wrote:
> >
> > Hi Andrew, thanks a lot for this KIP. I was thinking of something similar
> > so thanks for writing this down 
> >
> >
> >
> > Couple of questions related to the design:
> >
> >
> >
> > 1. Can we investigate the option for using the Kraft controllers instead
> of
> > the brokers for sending metrics? The disadvantage of sending these
> metrics
> > directly to the brokers tightly couples metric observability to data
> plane
> > availability. If the broker is unhealthy then the root cause of an
> incident
> > is clear however on partial failures it makes it hard to debug these
> > incidents from the brokers perspective.
> >
> >
> >
> > 2. Ratelimiting will be disable if the `PushTelemetryRequest.Terminating`
> > flag is set. However, this may cause unavailability on the broker if too
> > many clients are terminated at once, especially network threads could
> > become busy and introduce latency on the produce/consume on other
> > non-terminating clients connections. I think there is a room for
> > improvement here. If the client is gracefully shutting down, it could
> wait
> > for the request to be handled if it is being ratelimited, it doesn't need
> > to "force push" the metrics. For that reason, maybe we could define a
> > separate ratelimiting for telemetry data?
> >
> >
> >
> > 3. `PushIntervalMs` is set on the client side by a response from
> > `GetTelemetrySubscriptionsResponse`. If the broker sets this value to too
> > low, like 1msec, this may hog all of the clients activity and cause an
> > impact on the client side. I think we should introduce a configuration
> both
> > on the client and the broker side for the minimum and maximum numbers for
> > this value to fence out misconfigurations.
> >
> >
> >
> 

Re: [DISCUSS] KIP-943: Add independent "offset.storage.segment.bytes" for connect-distributed.properties

2023-08-08 Thread hudeqi
Sorry for not getting email reminders and ignoring your reply for getting back 
so late, Yash Mayya, Greg Harris, Sagar.

Thank you for your thoughts and suggestions, I learned a lot, I will give my 
thoughts and answers in a comprehensive way:
1. The default configuration of 50MB is the online configuration I actually 
used to solve this problem, and the effect is better (see the description of 
jira:https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-15086?filter=allopenissues.
 In fact, I think it may be better to set this value smaller, so I abandoned 
the default value like __consumer_offsets, but I don't know how much the 
default value is the best.). Secondly, I also set the default value of 50MB 
online through ConfigDef#defineInternal, and if the value configured by the 
user is greater than the default value, the warning log will be displayed, but 
the only difference from your said is that I will overwrite the value 
configured by the user with the default value (emmm, this point was denied by 
Chris Egerton: https://github.com/apache/kafka/pull/13852, in fact, you all 
agree that should not directly override the user-configured value, and now I 
agree with this). 
2. I think the potential bug that Greg mentioned may lead to inconsistent state 
between workers is a great point. It is true that we cannot directly change the 
configuration for an existing internal topics. Perhaps a more tricky and 
disgusting approach is that we manually find that the active segment sizes of 
all current partitions are relatively small, first stop all connect instances, 
then change the topic configuration, and finally start the instances.

To sum up, I think whether the scope of the KIP could be reduced to: only set 
the default value of the 'segment.bytes' of the internal topics and make a 
warning for the bigger value configured by the user. What do you think? If 
there's a better way I'm all ears.

best,
hudeqi