[jira] [Resolved] (KAFKA-13943) Fix flaky test QuorumControllerTest.testMissingInMemorySnapshot()
[ https://issues.apache.org/jira/browse/KAFKA-13943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Gustafson resolved KAFKA-13943. - Resolution: Fixed > Fix flaky test QuorumControllerTest.testMissingInMemorySnapshot() > - > > Key: KAFKA-13943 > URL: https://issues.apache.org/jira/browse/KAFKA-13943 > Project: Kafka > Issue Type: Bug > Components: unit tests >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Labels: flaky-test > Fix For: 3.3.0, 3.3 > > > Test failed at > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-12197/3/tests] > > {noformat} > [2022-05-27 09:34:42,382] INFO [Controller 0] Creating new QuorumController > with clusterId wj9LhgPJTV-KYEItgqvtQA, authorizer Optional.empty. > (org.apache.kafka.controller.QuorumController:1484) > [2022-05-27 09:34:42,393] DEBUG [LocalLogManager 0] Node 0: running log > check. (org.apache.kafka.metalog.LocalLogManager:479) > [2022-05-27 09:34:42,394] DEBUG [LocalLogManager 0] initialized local log > manager for node 0 (org.apache.kafka.metalog.LocalLogManager:622) > [2022-05-27 09:34:42,396] INFO [LocalLogManager 0] Node 0: registered > MetaLogListener 1774961169 (org.apache.kafka.metalog.LocalLogManager:640) > [2022-05-27 09:34:42,397] DEBUG [LocalLogManager 0] Node 0: running log > check. (org.apache.kafka.metalog.LocalLogManager:479) > [2022-05-27 09:34:42,397] DEBUG [LocalLogManager 0] Node 0: Executing > handleLeaderChange LeaderAndEpoch(leaderId=OptionalInt[0], epoch=1) > (org.apache.kafka.metalog.LocalLogManager:520) > [2022-05-27 09:34:42,398] DEBUG [Controller 0] Executing > handleLeaderChange[1]. (org.apache.kafka.controller.QuorumController:438) > [2022-05-27 09:34:42,398] INFO [Controller 0] Becoming the active controller > at epoch 1, committed offset -1, committed epoch -1, and metadata.version 5 > (org.apache.kafka.controller.QuorumController:950) > [2022-05-27 09:34:42,398] DEBUG [Controller 0] Creating snapshot -1 > (org.apache.kafka.timeline.SnapshotRegistry:197) > [2022-05-27 09:34:42,399] DEBUG [Controller 0] Processed > handleLeaderChange[1] in 951 us > (org.apache.kafka.controller.QuorumController:385) > [2022-05-27 09:34:42,399] INFO [Controller 0] Initializing metadata.version > to 5 (org.apache.kafka.controller.QuorumController:926) > [2022-05-27 09:34:42,399] INFO [Controller 0] Setting metadata.version to 5 > (org.apache.kafka.controller.FeatureControlManager:273) > [2022-05-27 09:34:42,400] DEBUG [Controller 0] Creating snapshot > 9223372036854775807 (org.apache.kafka.timeline.SnapshotRegistry:197) > [2022-05-27 09:34:42,400] DEBUG [Controller 0] Read-write operation > bootstrapMetadata(1863535402) will be completed when the log reaches offset > 9223372036854775807. (org.apache.kafka.controller.QuorumController:725) > [2022-05-27 09:34:42,402] DEBUG append(batch=LocalRecordBatch(leaderEpoch=1, > appendTimestamp=10, > records=[ApiMessageAndVersion(RegisterBrokerRecord(brokerId=0, > incarnationId=kxAT73dKQsitIedpiPtwBw, brokerEpoch=-9223372036854775808, > endPoints=[BrokerEndpoint(name='PLAINTEXT', host='localhost', port=9092, > securityProtocol=0)], features=[], rack=null, fenced=true) at version 0)]), > prevOffset=1) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:247) > [2022-05-27 09:34:42,402] INFO [Controller 0] Registered new broker: > RegisterBrokerRecord(brokerId=0, incarnationId=kxAT73dKQsitIedpiPtwBw, > brokerEpoch=-9223372036854775808, endPoints=[BrokerEndpoint(name='PLAINTEXT', > host='localhost', port=9092, securityProtocol=0)], features=[], rack=null, > fenced=true) (org.apache.kafka.controller.ClusterControlManager:368) > [2022-05-27 09:34:42,403] WARN [Controller 0] registerBroker: failed with > unknown server exception RuntimeException at epoch 1 in 2449 us. Reverting > to last committed offset -1. > (org.apache.kafka.controller.QuorumController:410)java.lang.RuntimeException: > Can't create a new snapshot at epoch 1 because there is already a snapshot > with epoch 9223372036854775807at > org.apache.kafka.timeline.SnapshotRegistry.getOrCreateSnapshot(SnapshotRegistry.java:190) > at > org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:723) > at > org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121) > at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200) > at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173) > at java.base/java.lang.Thread.run(Thread.java:833){noformat} > {noformat} > Full stack trace > java.util.concurrent.ExecutionException: >
Re: [DISCUSS] KIP-847: Add ProducerCount metrics
I've updated the KIP to clarify that the metric reflects the total amount of producer ids in all partitions maintained in the broker. -Artem On Thu, Jun 30, 2022 at 11:46 AM Jun Rao wrote: > Hi, Artem, > > Thanks for the reply. > > The memory usage on the broker is proportional to the number of (partition, > pid) combinations. So, I am wondering if we could have a metric that > captures that. The proposed pid count metric doesn't fully capture that > since each pid could be associated with a different number of partitions. > > Thanks, > > Jun > > On Thu, Jun 30, 2022 at 9:24 AM Justine Olshan > > wrote: > > > Hi Artem, > > Thanks for the update to include motivation. Makes sense to me. > > Justine > > > > On Wed, Jun 29, 2022 at 6:51 PM Luke Chen wrote: > > > > > Hi Artem, > > > > > > Thanks for the update. > > > LGTM. > > > > > > Luke > > > > > > On Thu, Jun 30, 2022 at 6:51 AM Artem Livshits > > > wrote: > > > > > > > Thank you for your feedback. I've updated the KIP to elaborate on the > > > > motivation and provide some background on producer ids and how we > > measure > > > > them. > > > > > > > > Also, after some thinking and discussing it offline with some folks, > I > > > > think that we don't really need partitioner level metrics. We can > use > > > > existing tools to do granular debugging. I've moved partition level > > > > metrics to the rejected alternatives section. > > > > > > > > -Artem > > > > > > > > On Wed, Jun 29, 2022 at 1:57 AM Luke Chen wrote: > > > > > > > > > Hi Artem, > > > > > > > > > > Could you elaborate more in the motivation section? > > > > > I'm interested to know what kind of scenarios this metric can > benefit > > > > for. > > > > > What could it bring to us when a topic partition has 100 > > > ProducerIdCount > > > > VS > > > > > another topic partition has 10 ProducerIdCount? > > > > > > > > > > Thank you. > > > > > Luke > > > > > > > > > > On Wed, Jun 29, 2022 at 6:30 AM Jun Rao > > > > wrote: > > > > > > > > > > > Hi, Artem, > > > > > > > > > > > > Thanks for the KIP. > > > > > > > > > > > > Could you explain the partition level ProducerIdCount a bit more? > > > Does > > > > > that > > > > > > reflect the number of PIDs ever produced to a partition since the > > > > broker > > > > > is > > > > > > started? Do we reduce the count after a PID expires? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Jun > > > > > > > > > > > > On Wed, Jun 22, 2022 at 1:08 AM David Jacot > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi Artem, > > > > > > > > > > > > > > The KIP LGTM. > > > > > > > > > > > > > > Thanks, > > > > > > > David > > > > > > > > > > > > > > On Tue, Jun 21, 2022 at 9:32 PM Artem Livshits > > > > > > > wrote: > > > > > > > > > > > > > > > > If there is no other feedback I'm going to start voting in a > > > couple > > > > > > days. > > > > > > > > > > > > > > > > -Artem > > > > > > > > > > > > > > > > On Fri, Jun 17, 2022 at 3:50 PM Artem Livshits < > > > > > alivsh...@confluent.io > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Thank you for your feedback. Updated the KIP and added the > > > > > Rejected > > > > > > > > > Alternatives section. > > > > > > > > > > > > > > > > > > -Artem > > > > > > > > > > > > > > > > > > On Fri, Jun 17, 2022 at 1:16 PM Ismael Juma < > > ism...@juma.me.uk > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> If we don't track them separately, then it makes sense to > > keep > > > > it > > > > > as > > > > > > > one > > > > > > > > >> metric. I'd probably name it ProducerIdCount in that case. > > > > > > > > >> > > > > > > > > >> Ismael > > > > > > > > >> > > > > > > > > >> On Fri, Jun 17, 2022 at 12:04 PM Artem Livshits > > > > > > > > >> wrote: > > > > > > > > >> > > > > > > > > >> > Do you propose to have 2 metrics instead of one? Right > > now > > > we > > > > > > don't > > > > > > > > >> track > > > > > > > > >> > if the producer id was transactional or idempotent and > for > > > > > metric > > > > > > > > >> > collection we'd either have to pay the cost of iterating > > > over > > > > > > > producer > > > > > > > > >> ids > > > > > > > > >> > (which could be a lot) or split the producer map into 2 > or > > > > cache > > > > > > the > > > > > > > > >> > counts, which complicates the code. > > > > > > > > >> > > > > > > > > > >> > From the monitoring perspective, I think one metric > should > > > be > > > > > > good, > > > > > > > but > > > > > > > > >> > maybe I'm missing some scenarios. > > > > > > > > >> > > > > > > > > > >> > -Artem > > > > > > > > >> > > > > > > > > > >> > On Fri, Jun 17, 2022 at 12:28 AM Ismael Juma < > > > > ism...@juma.me.uk > > > > > > > > > > > > > wrote: > > > > > > > > >> > > > > > > > > > >> > > I like the suggestion to have IdempotentProducerCount > > and > > > > > > > > >> > > TransactionalProducerCount metrics. > > > > > > > > >> > > > > > > > > > > >> > > Ismael > > > > > > > > >> > > > > > > > > > > >> > > On Thu, Jun 16,
Kafka server problem
Hello!I encountered a kafka problem, the kafka server log reported an error :“Fetch request with correlation id 1171438 from client ReplicaFetcherThread-0-3 on partition [sp201804-18-part,16] failed due to Leader not local for partition“, restarting is useless , still report the same error,how to solve this problem?My kafka version is 10.0.8.2.1. chengweihong...@chinaccs.cn
[jira] [Resolved] (KAFKA-13957) Flaky Test StoreQueryIntegrationTest.shouldQuerySpecificActivePartitionStores
[ https://issues.apache.org/jira/browse/KAFKA-13957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Cadonna resolved KAFKA-13957. --- Resolution: Fixed > Flaky Test StoreQueryIntegrationTest.shouldQuerySpecificActivePartitionStores > - > > Key: KAFKA-13957 > URL: https://issues.apache.org/jira/browse/KAFKA-13957 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: Matthew de Detrich >Priority: Major > Labels: flaky-test > Attachments: > StoreQueryIntegrationTest.shouldQuerySpecificActivePartitionStores.rtf > > > Failed on a local build so I have the full logs (attached) > {code:java} > java.lang.AssertionError: Unexpected exception thrown while getting the value > from store. > Expected: is (a string containing "Cannot get state store source-table > because the stream thread is PARTITIONS_ASSIGNED, not RUNNING" or a string > containing "The state store, source-table, may have migrated to another > instance" or a string containing "Cannot get state store source-table because > the stream thread is STARTING, not RUNNING") > but: was "The specified partition 1 for store source-table does not > exist." > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.verifyRetrievableException(StoreQueryIntegrationTest.java:539) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQuerySpecificActivePartitionStores$5(StoreQueryIntegrationTest.java:241) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:557) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQuerySpecificActivePartitionStores(StoreQueryIntegrationTest.java:183) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at java.base/java.lang.Thread.run(Thread.java:833) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-13613) Kafka Connect has a hard dependency on KeyGenerator.HmacSHA256
[ https://issues.apache.org/jira/browse/KAFKA-13613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mickael Maison resolved KAFKA-13613. Fix Version/s: 3.3.0 Resolution: Fixed > Kafka Connect has a hard dependency on KeyGenerator.HmacSHA256 > -- > > Key: KAFKA-13613 > URL: https://issues.apache.org/jira/browse/KAFKA-13613 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 3.0.0 > Environment: RHEL 8.5 > OpenJDK 1.8.0_312 or 11 > Confluent Platform 7.0.1 (Kafka 3.0.0) >Reporter: Guy Pascarella >Assignee: Chris Egerton >Priority: Major > Fix For: 3.3.0 > > > If a server is running Java 8 that has been configured for FIPS mode > according to > [openjdk-8-configuring_openjdk_8_on_rhel_with_fips-en-us.pdf|https://access.redhat.com/documentation/en-us/openjdk/8/pdf/configuring_openjdk_8_on_rhel_with_fips/openjdk-8-configuring_openjdk_8_on_rhel_with_fips-en-us.pdf] > then the SunJCE provider is not available. As such the KeyGenerator > HmacSHA256 is not available. The KeyGenerators I see available are > * DES > * ARCFOUR > * AES > * DESede > Out of these I think AES would be most appropriate, but that's not the point > of this issue, just including for completeness. > When Kafka Connect is started in distributed mode on one of these servers I > see the following stack trace > {noformat} > [2022-01-20 20:36:30,027] ERROR Stopping due to error > (org.apache.kafka.connect.cli.ConnectDistributed) > java.lang.ExceptionInInitializerError > at > org.apache.kafka.connect.cli.ConnectDistributed.startConnect(ConnectDistributed.java:94) > at > org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:79) > Caused by: org.apache.kafka.common.config.ConfigException: Invalid value > HmacSHA256 for configuration inter.worker.key.generation.algorithm: > HmacSHA256 KeyGenerator not available > at > org.apache.kafka.connect.runtime.distributed.DistributedConfig.validateKeyAlgorithm(DistributedConfig.java:504) > at > org.apache.kafka.connect.runtime.distributed.DistributedConfig.lambda$configDef$2(DistributedConfig.java:375) > at > org.apache.kafka.common.config.ConfigDef$LambdaValidator.ensureValid(ConfigDef.java:1043) > at > org.apache.kafka.common.config.ConfigDef$ConfigKey.(ConfigDef.java:1164) > at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:152) > at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:172) > at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:211) > at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:373) > at > org.apache.kafka.connect.runtime.distributed.DistributedConfig.configDef(DistributedConfig.java:371) > at > org.apache.kafka.connect.runtime.distributed.DistributedConfig.(DistributedConfig.java:196) > ... 2 more > {noformat} > It appears the > {{org.apache.kafka.connect.runtime.distributed.DistributedConfig}} is > triggering a validation of the hard-coded default > {{inter.worker.key.generation.algorithm}} property, which is {{HmacSHA256}}. > Ideally a fix would use the value from the configuration file before > attempting to validate a default value. > Updates [2022/01/27]: I just tested on a FIPS-enabled version of OpenJDK 11 > using the instructions at > [configuring_openjdk_11_on_rhel_with_fips|https://access.redhat.com/documentation/en-us/openjdk/11/html-single/configuring_openjdk_11_on_rhel_with_fips/index], > which resulted in the same issues. One workaround is to disable FIPS for > Kafka Connect by passing in the JVM parameter {{-Dcom.redhat.fips=false}}, > however, that means Kafka Connect and all the workers are out of compliance > for anyone required to use FIPS-enabled systems. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14046) Improve kafka doc about "emit.checkpoints.interval.seconds" description,together with "sync.group.offsets.interval.seconds"
Justinwins created KAFKA-14046: -- Summary: Improve kafka doc about "emit.checkpoints.interval.seconds" description,together with "sync.group.offsets.interval.seconds" Key: KAFKA-14046 URL: https://issues.apache.org/jira/browse/KAFKA-14046 Project: Kafka Issue Type: Task Reporter: Justinwins Attachments: image-2022-07-05-17-06-02-384.png The MirrorCheckpointTask thread will block for "emit.checkpoints.interval.seconds" before it re-caculates group offsets to sync. You can see the pic downside. Let's say, emit.checkpoints.interval.seconds=60 , and sync.group.offsets.interval.seconds=30 ,then group offsets to sync will be refreshed every 60s,while these group offsets will be committed to the target cluster TWICE before they are refreshed. The second commit is redundant. So i think it's better to point out how the two parameters should 'coordinate' . !image-2022-07-05-17-06-02-384.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14044) Upgrade Netty and Jackson for CVE fixes
[ https://issues.apache.org/jira/browse/KAFKA-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Chen resolved KAFKA-14044. --- Fix Version/s: 3.3.0 Resolution: Fixed > Upgrade Netty and Jackson for CVE fixes > --- > > Key: KAFKA-14044 > URL: https://issues.apache.org/jira/browse/KAFKA-14044 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.2.0 >Reporter: Thomas Cooper >Assignee: Thomas Cooper >Priority: Minor > Labels: security > Fix For: 3.3.0 > > > There are a couple of CVEs for netty and Jackson: > Netty: [CVE-2022-24823|https://www.cve.org/CVERecord?id=CVE-2022-24823] - > Fixed by upgrading to 4.1.77+ > Jackson: [CVE-2020-36518|https://www.cve.org/CVERecord?id=CVE-2020-36518] - > Fixed by upgrading to 2.13.0+ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14045) Heartbeat threads cause high CPU usage after broker shut down
gdd sop created KAFKA-14045: --- Summary: Heartbeat threads cause high CPU usage after broker shut down Key: KAFKA-14045 URL: https://issues.apache.org/jira/browse/KAFKA-14045 Project: Kafka Issue Type: Bug Reporter: gdd sop 1. shut down all brokers 2. execute top command and found out the kafka coordinator CPU usage is high Here is the flame graph collectd by arthas. ![](blob:https://issues.apache.org/b9da8430-d4e4-46c4-a9a1-655b0282950e) The issues will recover after restart the brokers. **Maybe it should wait before the pollNoWakeup methods** https://github.com/apache/kafka/blob/2.8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L1372 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-13897) Add 3.1.1 to system tests and streams upgrade tests
[ https://issues.apache.org/jira/browse/KAFKA-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christo Lolov resolved KAFKA-13897. --- Resolution: Won't Fix > Add 3.1.1 to system tests and streams upgrade tests > --- > > Key: KAFKA-13897 > URL: https://issues.apache.org/jira/browse/KAFKA-13897 > Project: Kafka > Issue Type: Task > Components: streams, system tests >Reporter: Tom Bentley >Priority: Blocker > Fix For: 3.3.0, 3.1.2, 3.2.1 > > > Per the penultimate bullet on the [release > checklist|https://cwiki.apache.org/confluence/display/KAFKA/Release+Process#ReleaseProcess-Afterthevotepasses], > Kafka v3.1.1 is released. We should add this version to the system tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)