Request for filing KIP

2024-05-20 Thread Harsh Panchal
Dear Apache Kafka Team,

As instructed, I would like to write a KIP for PR -
https://github.com/apache/kafka/pull/15905.

I see that I don't have access to the "Create KIP" button on confluence. I
kindly request you to grant access to write up KIP. My user name is: bootmgr

Best Regards,
Harsh Panchal


[jira] [Resolved] (KAFKA-16760) alterReplicaLogDirs failed even if responded with none error

2024-05-20 Thread Luke Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Chen resolved KAFKA-16760.
---
Resolution: Not A Problem

> alterReplicaLogDirs failed even if responded with none error
> 
>
> Key: KAFKA-16760
> URL: https://issues.apache.org/jira/browse/KAFKA-16760
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Luke Chen
>Priority: Major
>
> When firing alterLogDirRequest, it gets error NONE result. But actually, the 
> alterLogDir never happened with these errors:
> {code:java}
> [2024-05-14 16:48:50,796] INFO [ReplicaAlterLogDirsThread-1]: Partition 
> topicB-0 has an older epoch (0) than the current leader. Will await the new 
> LeaderAndIsr state before resuming fetching. 
> (kafka.server.ReplicaAlterLogDirsThread:66)
> [2024-05-14 16:48:50,796] WARN [ReplicaAlterLogDirsThread-1]: Partition 
> topicB-0 marked as failed (kafka.server.ReplicaAlterLogDirsThread:70)
> {code}
> Note: It's under KRaft mode. So the log with LeaderAndIsr is wrong. 
> This can be reproduced in this 
> [branch|https://github.com/showuon/kafka/tree/alterLogDirTest] and running 
> this test:
> {code:java}
> ./gradlew cleanTest storage:test --tests 
> org.apache.kafka.tiered.storage.integration.AlterLogDirTest
> {code}
> The complete logs can be found here: 
> https://gist.github.com/showuon/b16cdb05a125a7c445cc6e377a2b7923



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


requesting permissions to contribute to Apache Kafka.

2024-05-20 Thread jiang dou
hello:

my wiki ID  is  'bmilk'  and Jira ID is  'bmilk'
thank you


Re: [DISCUSS] KIP-1027 Add MockFixedKeyProcessorContext

2024-05-20 Thread Matthias J. Sax
Had a discussion on https://issues.apache.org/jira/browse/KAFKA-15242 
and it was pointed out, that we also need to do something about 
`FixedKeyRecord`. It does not have a public constructor (what is 
correct; it should not have one). However, this makes testing 
`FixedKeyProcessor` impossible w/o extending `FixedKeyRecord` manually 
what does not seem to be right (too clumsy).


It seems, we either need some helper builder method (but not clear to me 
where to add it in an elegant way) which would provide us with a 
`FixedKeyRecord`, or add some sub-class to the test-utils module which 
would extend `FixedKeyRecord`? -- Or maybe an even better solution? I 
could not think of something else so far.



Thoughts?


On 5/3/24 9:46 AM, Matthias J. Sax wrote:

Please also update the KIP.

To get a wiki account created, please request it via a commet on this 
ticket: https://issues.apache.org/jira/browse/INFRA-25451


After you have the account, please share your wiki id, and we can give 
you write permission on the wiki.




-Matthias

On 5/3/24 6:30 AM, Shashwat Pandey wrote:

Hi Matthias,

Sorry this fell out of my radar for a bit.

Revisiting the topic, I think you’re right and we accept the duplicated
nesting as an appropriate solution to not affect the larger public API.

I can update my PR with the change.

Regards,
Shashwat Pandey


On Wed, May 1, 2024 at 11:00 PM Matthias J. Sax  wrote:


Any updates on this KIP?

On 3/28/24 4:11 AM, Matthias J. Sax wrote:
It seems that `MockRecordMetadata` is a private class, and thus not 
part

of the public API. If there are any changes required, we don't need to
discuss on the KIP.


For `CapturedPunctuator` and `CapturedForward` it's a little bit more
tricky. My gut feeling is, that the classes might not need to be
changed, but if we use them within `MockProcessorContext` and
`MockFixedKeyProcessorContext` it might be weird to keep the current
nesting... The problem I see is, that it's not straightforward how to
move the classes w/o breaking compatibility, nor if we duplicate 
them as

standalone classes w/o a larger "splash radius". (We would need to add
new overloads for MockProcessorContext#scheduledPunctuators() and
MockProcessorContext#forwarded()).

Might be good to hear from others if we think it's worth this larger
changes to get rid of the nesting, or just accept the somewhat not 
ideal

nesting as it technically is not a real issue?


-Matthias


On 3/15/24 1:47 AM, Shashwat Pandey wrote:

Thanks for the feedback Matthias!

The reason I proposed the extension of MockProcessorContext was more
to do
with the internals of the class (MockRecordMetadata,
CapturedPunctuator and
CapturedForward).

However, I do see your point, I would then think to split
MockProcessorContext and MockFixedKeyProcessorContext, some of the
internal
classes should also be extracted i.e. MockRecordMetadata,
CapturedPunctuator and probably a new CapturedFixedKeyForward.

Let me know what you think!


Regards,
Shashwat Pandey


On Mon, Mar 11, 2024 at 10:09 PM Matthias J. Sax 
wrote:

Thanks for the KIP Shashwat. Closing this testing gap is great! It 
did

come up a few time already...

One question: why do you propose to `extend MockProcessorContext`?

Given how the actual runtime context classes are setup, it seems that
the regular context and fixed-key-context are distinct, and thus I
believe both mock-context classes should be distinct, too?

What I mean is that FixedKeyProcessorContext does not extend
ProcessorContext. Both classes have a common parent ProcessINGContext
(note the very similar but different names), but they are "siblings"
only, so why make the mock processor a parent-child relationship?

It seems better to do

public class MockFixedKeyProcessorContext
 implements FixedKeyProcessorContext,
    RecordCollector.Supplier


Of course, if there is code we can share between both mock-context we
should so this, but it should not leak into the public API?


-Matthias



On 3/11/24 5:21 PM, Shashwat Pandey wrote:

Hi everyone,

I would like to start the discussion on




https://cwiki.apache.org/confluence/display/KAFKA/KIP-1027%3A+Add+MockFixedKeyProcessorContext


This adds MockFixedKeyProcessorContext to the Kafka Streams Test 
Utils

library.

Regards,
Shashwat Pandey











Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2921

2024-05-20 Thread Apache Jenkins Server
See 




[jira] [Resolved] (KAFKA-16197) Connect Worker poll timeout prints Consumer poll timeout specific warnings.

2024-05-20 Thread Greg Harris (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Harris resolved KAFKA-16197.
-
Fix Version/s: 3.8.0
   Resolution: Fixed

> Connect Worker poll timeout prints Consumer poll timeout specific warnings.
> ---
>
> Key: KAFKA-16197
> URL: https://issues.apache.org/jira/browse/KAFKA-16197
> Project: Kafka
>  Issue Type: Bug
>  Components: connect
>Reporter: Sagar Rao
>Assignee: Sagar Rao
>Priority: Major
> Fix For: 3.8.0
>
>
> When a Connect worker's poll timeout expires in Connect, the log lines that 
> we see are:
> {noformat}
> consumer poll timeout has expired. This means the time between subsequent 
> calls to poll() was longer than the configured max.poll.interval.ms, which 
> typically implies that the poll loop is spending too much time processing 
> messages. You can address this either by increasing max.poll.interval.ms or 
> by reducing the maximum size of batches returned in poll() with 
> max.poll.records.
> {noformat}
> and the reason for leaving the group is 
> {noformat}
> Member XX sending LeaveGroup request to coordinator XX due to consumer poll 
> timeout has expired.
> {noformat}
> which is specific to Consumers and not to Connect workers. The log line above 
> in specially misleading because the config `max.poll.interval.ms` is not 
> configurable for a Connect worker and could make someone believe that the 
> logs are being written for Sink Connectors and not for Connect worker. 
> Ideally, we should print something specific to Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16806) Explicitly declare JUnit dependencies for all test modules

2024-05-20 Thread Greg Harris (Jira)
Greg Harris created KAFKA-16806:
---

 Summary: Explicitly declare JUnit dependencies for all test modules
 Key: KAFKA-16806
 URL: https://issues.apache.org/jira/browse/KAFKA-16806
 Project: Kafka
  Issue Type: Sub-task
Reporter: Greg Harris


The automatic loading of test framework implementation dependencies has been 
deprecated.    
This is scheduled to be removed in Gradle 9.0.    
Declare the desired test framework directly on the test suite or explicitly 
declare the test framework implementation dependencies on the test's runtime 
classpath.    
[Documentation|https://docs.gradle.org/8.7/userguide/upgrading_version_8.html#test_framework_implementation_dependencies]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16805) Stop using a ClosureBackedAction to configure Spotbugs reports

2024-05-20 Thread Greg Harris (Jira)
Greg Harris created KAFKA-16805:
---

 Summary: Stop using a ClosureBackedAction to configure Spotbugs 
reports
 Key: KAFKA-16805
 URL: https://issues.apache.org/jira/browse/KAFKA-16805
 Project: Kafka
  Issue Type: Sub-task
Reporter: Greg Harris


The org.gradle.util.ClosureBackedAction type has been deprecated.    
This is scheduled to be removed in Gradle 9.0.    
[Documentation|https://docs.gradle.org/8.7/userguide/upgrading_version_7.html#org_gradle_util_reports_deprecations]
    
1 usage    
[https://github.com/apache/kafka/blob/95adb7bfbfc69b3e9f3538cc5d6f7c6a577d30ee/build.gradle#L745-L749]
 



 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16804) Replace gradle archivesBaseName with archivesName

2024-05-20 Thread Greg Harris (Jira)
Greg Harris created KAFKA-16804:
---

 Summary: Replace gradle archivesBaseName with archivesName
 Key: KAFKA-16804
 URL: https://issues.apache.org/jira/browse/KAFKA-16804
 Project: Kafka
  Issue Type: Sub-task
Reporter: Greg Harris


The BasePluginExtension.archivesBaseName property has been deprecated.    
This is scheduled to be removed in Gradle 9.0.    
Please use the archivesName property instead.    
[Documentation|https://docs.gradle.org/8.7/dsl/org.gradle.api.plugins.BasePluginExtension.html#org.gradle.api.plugins.BasePluginExtension:archivesName]
1 usage    
Script:build.gradle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16803) Upgrade to a version of ShadowJavaPlugin which doesn't use ConfigureUtil

2024-05-20 Thread Greg Harris (Jira)
Greg Harris created KAFKA-16803:
---

 Summary: Upgrade to a version of ShadowJavaPlugin which doesn't 
use ConfigureUtil
 Key: KAFKA-16803
 URL: https://issues.apache.org/jira/browse/KAFKA-16803
 Project: Kafka
  Issue Type: Sub-task
Reporter: Greg Harris


The org.gradle.util.ConfigureUtil type has been deprecated.    
This is scheduled to be removed in Gradle 9.0.    
[Documentation|https://docs.gradle.org/8.7/userguide/upgrading_version_8.html#org_gradle_util_reports_deprecations]
2 usages    
Plugin:com.github.jengelman.gradle.plugins.shadow.ShadowJavaPlugin    
Plugin:com.github.jengelman.gradle.plugins.shadow.ShadowJavaPlugin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16802) Move build.gradle java version information inside of a java block

2024-05-20 Thread Greg Harris (Jira)
Greg Harris created KAFKA-16802:
---

 Summary: Move build.gradle java version information inside of a 
java block
 Key: KAFKA-16802
 URL: https://issues.apache.org/jira/browse/KAFKA-16802
 Project: Kafka
  Issue Type: Sub-task
Reporter: Greg Harris


The org.gradle.api.plugins.JavaPluginConvention type has been deprecated.    
This is scheduled to be removed in Gradle 9.0.    
[Documentation|https://docs.gradle.org/8.7/userguide/upgrading_version_8.html#java_convention_deprecation]
  
[https://github.com/apache/kafka/blob/95adb7bfbfc69b3e9f3538cc5d6f7c6a577d30ee/build.gradle#L292-L295]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16801) Streams upgrade :test target doesn't find any junit tests

2024-05-20 Thread Greg Harris (Jira)
Greg Harris created KAFKA-16801:
---

 Summary: Streams upgrade :test target doesn't find any junit tests
 Key: KAFKA-16801
 URL: https://issues.apache.org/jira/browse/KAFKA-16801
 Project: Kafka
  Issue Type: Sub-task
  Components: streams
Reporter: Greg Harris


No test executed. This behavior has been deprecated.    
This will fail with an error in Gradle 9.0.    
There are test sources present but no test was executed. Please check your test 
configuration.    
[Documentation|https://docs.gradle.org/8.7/userguide/upgrading_version_8.html#test_task_fail_on_no_test_executed]
    
23 usages

 
Task::streams:upgrade-system-tests-0100:test    
Task::streams:upgrade-system-tests-0101:test    
Task::streams:upgrade-system-tests-0102:test    
Task::streams:upgrade-system-tests-0110:test    
Task::streams:upgrade-system-tests-10:test    
Task::streams:upgrade-system-tests-11:test    
Task::streams:upgrade-system-tests-20:test    
Task::streams:upgrade-system-tests-21:test    
Task::streams:upgrade-system-tests-22:test    
Task::streams:upgrade-system-tests-23:test    
Task::streams:upgrade-system-tests-24:test    
Task::streams:upgrade-system-tests-25:test    
Task::streams:upgrade-system-tests-26:test    
Task::streams:upgrade-system-tests-27:test    
Task::streams:upgrade-system-tests-28:test    
Task::streams:upgrade-system-tests-30:test    
Task::streams:upgrade-system-tests-31:test    
Task::streams:upgrade-system-tests-32:test    
Task::streams:upgrade-system-tests-33:test    
Task::streams:upgrade-system-tests-34:test    
Task::streams:upgrade-system-tests-35:test    
Task::streams:upgrade-system-tests-36:test    
Task::streams:upgrade-system-tests-37:test



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16800) Resolve Gradle 9.0 deprecations

2024-05-20 Thread Greg Harris (Jira)
Greg Harris created KAFKA-16800:
---

 Summary: Resolve Gradle 9.0 deprecations
 Key: KAFKA-16800
 URL: https://issues.apache.org/jira/browse/KAFKA-16800
 Project: Kafka
  Issue Type: Task
Reporter: Greg Harris


Gradle prints the following warning in our build:
{noformat}
Deprecated Gradle features were used in this build, making it incompatible with 
Gradle 9.0.{noformat}
We should try to resolve these build warnings to prepare for the future release 
of Gradle 9.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16799) NetworkClientDelegate is not backing off if the node is not found

2024-05-20 Thread Philip Nee (Jira)
Philip Nee created KAFKA-16799:
--

 Summary: NetworkClientDelegate is not backing off if the node is 
not found
 Key: KAFKA-16799
 URL: https://issues.apache.org/jira/browse/KAFKA-16799
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Reporter: Philip Nee
Assignee: Philip Nee


When performing stress testing, I found that AsycnKafkaConsumer's network 
client delegate isn't backing off if the node is not ready, causing a large 
number of: 
{code:java}
 358 [2024-05-20 22:59:02,591] DEBUG [Consumer 
clientId=consumer.7136899e-0c20-4ccb-8ba3-497e9e683594-0, 
groupId=consumer-groups-test-5] Node is not ready, handle the request in the 
next event loop: node=b4-pkc-devcmkz697.us-west-2.aws.devel.cpd     
ev.cloud:9092 (id: 2147483643 rack: null), 
request=UnsentRequest{requestBuilder=ConsumerGroupHeartbeatRequestData(groupId='consumer-groups-test-5',
 memberId='', memberEpoch=0, instanceId=null, rackId=null, 
rebalanceTimeoutMs=10, subscri     
bedTopicNames=[_kengine-565-test-topic8081], serverAssignor=null, 
topicPartitions=[]), 
handler=org.apache.kafka.clients.consumer.internals.NetworkClientDelegate$FutureCompletionHandler@139a8761,
 node=Optional[b4-pkc-devcmkz697.us-west-2.aws     .devel.cpdev.cloud:9092 (id: 
2147483643 rack: null)], timer=org.apache.kafka.common.utils.Timer@649fffad} 
(org.apache.kafka.clients.consumer.internals.NetworkClientDelegate:169) {code}
show up in the log.

What should have happened is: 1. node is not ready 2. exponential back off 3. 
retry



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16645) CVEs in 3.7.0 docker image

2024-05-20 Thread Igor Soarez (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Soarez resolved KAFKA-16645.
-
Resolution: Resolved

> CVEs in 3.7.0 docker image
> --
>
> Key: KAFKA-16645
> URL: https://issues.apache.org/jira/browse/KAFKA-16645
> Project: Kafka
>  Issue Type: Task
>Affects Versions: 3.7.0
>Reporter: Mickael Maison
>Assignee: Igor Soarez
>Priority: Blocker
> Fix For: 3.8.0, 3.7.1
>
>
> Our [Docker Image CVE 
> Scanner|https://github.com/apache/kafka/actions/runs/874393] GitHub 
> action reports 2 high CVEs in our base image:
> apache/kafka:3.7.0 (alpine 3.19.1)
> ==
> Total: 2 (HIGH: 2, CRITICAL: 0)
> ┌──┬┬──┬┬───┬───┬─┐
> │ Library  │ Vulnerability  │ Severity │ Status │ Installed Version │ Fixed 
> Version │Title│
> ├──┼┼──┼┼───┼───┼─┤
> │ libexpat │ CVE-2023-52425 │ HIGH │ fixed  │ 2.5.0-r2  │ 
> 2.6.0-r0  │ expat: parsing large tokens can trigger a denial of service │
> │  ││  ││   │ 
>   │ https://avd.aquasec.com/nvd/cve-2023-52425  │
> │  ├┤  ││   
> ├───┼─┤
> │  │ CVE-2024-28757 │  ││   │ 
> 2.6.2-r0  │ expat: XML Entity Expansion │
> │  ││  ││   │ 
>   │ https://avd.aquasec.com/nvd/cve-2024-28757  │
> └──┴┴──┴┴───┴───┴─┘
> Looking at the 
> [KIP|https://cwiki.apache.org/confluence/display/KAFKA/KIP-975%3A+Docker+Image+for+Apache+Kafka#KIP975:DockerImageforApacheKafka-WhatifweobserveabugoracriticalCVEinthereleasedApacheKafkaDockerImage?]
>  that introduced the docker images, it seems we should release a bugfix when 
> high CVEs are detected. It would be good to investigate and assess whether 
> Kafka is impacted or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (KAFKA-16645) CVEs in 3.7.0 docker image

2024-05-20 Thread Igor Soarez (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Soarez reopened KAFKA-16645:
-

Need to re-open to change the resolution, release_notes.py doesn't like the one 
I picked

> CVEs in 3.7.0 docker image
> --
>
> Key: KAFKA-16645
> URL: https://issues.apache.org/jira/browse/KAFKA-16645
> Project: Kafka
>  Issue Type: Task
>Affects Versions: 3.7.0
>Reporter: Mickael Maison
>Assignee: Igor Soarez
>Priority: Blocker
> Fix For: 3.8.0, 3.7.1
>
>
> Our [Docker Image CVE 
> Scanner|https://github.com/apache/kafka/actions/runs/874393] GitHub 
> action reports 2 high CVEs in our base image:
> apache/kafka:3.7.0 (alpine 3.19.1)
> ==
> Total: 2 (HIGH: 2, CRITICAL: 0)
> ┌──┬┬──┬┬───┬───┬─┐
> │ Library  │ Vulnerability  │ Severity │ Status │ Installed Version │ Fixed 
> Version │Title│
> ├──┼┼──┼┼───┼───┼─┤
> │ libexpat │ CVE-2023-52425 │ HIGH │ fixed  │ 2.5.0-r2  │ 
> 2.6.0-r0  │ expat: parsing large tokens can trigger a denial of service │
> │  ││  ││   │ 
>   │ https://avd.aquasec.com/nvd/cve-2023-52425  │
> │  ├┤  ││   
> ├───┼─┤
> │  │ CVE-2024-28757 │  ││   │ 
> 2.6.2-r0  │ expat: XML Entity Expansion │
> │  ││  ││   │ 
>   │ https://avd.aquasec.com/nvd/cve-2024-28757  │
> └──┴┴──┴┴───┴───┴─┘
> Looking at the 
> [KIP|https://cwiki.apache.org/confluence/display/KAFKA/KIP-975%3A+Docker+Image+for+Apache+Kafka#KIP975:DockerImageforApacheKafka-WhatifweobserveabugoracriticalCVEinthereleasedApacheKafkaDockerImage?]
>  that introduced the docker images, it seems we should release a bugfix when 
> high CVEs are detected. It would be good to investigate and assess whether 
> Kafka is impacted or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


DescribeLogDirs in Kafka 3.3.1 returns all topics instead of one provided in the request. Bug or "bad user error"?

2024-05-20 Thread Maxim Senin
Hello.

I’m having a problem with Kafka protocol API.

Requests:
DescribeLogDirs Request (Version: 0) => [topics]
  topics => topic [partitions]
topic => STRING
partitions => INT32

My request contains `[{topic: “blah”, partitions: [0,1,2,3,4,5,6,7,8,9]}]`, but 
the result

Responses:
DescribeLogDirs Response (Version: 0) => throttle_time_ms [results]
  throttle_time_ms => INT32
  results => error_code log_dir [topics]
error_code => INT16
log_dir => STRING
topics => name [partitions]
  name => STRING
  partitions => partition_index partition_size offset_lag is_future_key
partition_index => INT32
partition_size => INT64
offset_lag => INT64
is_future_key => BOOLEAN



 contains entries for *all* topics. My workaround had been to filter the 
returned list by topic name to find the one I was requesting the data for, but 
I don’t understand why it’s not limiting the results to just the topic I 
requested in the first place.

Also, I think there should be an option to just specify ALL_PARTITIONS because 
that would save me from having to retrieve topic metadata from the broker to 
count the number of partitions. Kafka server would probably have means to do 
that more efficiently.

Is this a bug or am I doing something wrong?

Thanks,
Maxim



COGILITY SOFTWARE CORPORATION LEGAL DISCLAIMER: The information in this email 
is confidential and is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful.


Re: [DISCUSS] KIP-1045 Decide MockAdminClient to move to public api or not

2024-05-20 Thread Muralidhar Basani
Hi all,

Any thoughts on this ? In my view, it helps developers in having an end to
end testing framework embedded into their applications, right from mocking
of creating topics, acls itself.
Even though creation and listing of these are done during design time,
automating these could be beneficial.

MockProducer and MockConsumer are part of the public api already.

Thanks,
Murali

On Thu, May 16, 2024 at 11:10 AM Muralidhar Basani <
muralidhar.bas...@aiven.io> wrote:

> Hello,
>
> As part of this KIP, I would like to take your opinions in
> moving MockAdminClient to the src folder, making it public, would it be
> beneficial or not.
>
> KIP-1045
> 
>
> Currently MockConsumer and MockProducer are part of the public API. They
> are useful for developers wanting to test their applications. On the other
> hand MockAdminClient is not part of the public API (it's under test). We
> should consider moving it to src so users can also easily test applications
> that depend on Admin. (Mentioned by Mickael Maison in KAFKA-15258
> )
>
> Thanks,
> Murali
>
> Muralidhar Basani
> Staff Software Engineer, Aiven
> muralidhar.bas...@aiven.io
>
>


[DISCUSS] KIP-655: Add deduplication processor in kafka-streams

2024-05-20 Thread Ayoub
Hello,

Following a discussion on community slack channel, I would like to revive
the discussion on the KIP-655, which is about adding a deduplication
processor in kafka-streams.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-655%3A+Windowed+Distinct+Operation+for+Kafka+Streams+API

Even though the motivation is not quite the same as the initial one, I
updated the KIP rather than creating a new one, as I believe the end goal
is the same.

Thanks,
Ayoub


[jira] [Resolved] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-20 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan resolved KAFKA-16692.

Fix Version/s: 3.6.3
   Resolution: Fixed

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.7.0, 3.6.1, 3.8
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
> Fix For: 3.7.1, 3.6.3, 3.8
>
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that has _discoverBrokerVersions_ set to {_}false{_}. 
> I was hoping I could get some assistance debugging this issue. Happy to 
> provide any additional information needed.
>  
>  
>  



--
This message was sent by Atlassian 

Re: [DISCUSS] KIP-1044: A proposal to change idempotent producer -- server implementation

2024-05-20 Thread Justine Olshan
My team has looked at it from a high level, but we haven't had the time to
come up with a full proposal.

I'm not aware if others have worked on it.

Justine

On Mon, May 20, 2024 at 10:21 AM Omnia Ibrahim 
wrote:

> Hi Justine are you aware of anyone looking into such new protocol at the
> moment?
>
> > On 20 May 2024, at 18:00, Justine Olshan 
> wrote:
> >
> > I would say I have first hand knowledge of this issue as someone who
> > responds to such incidents as part of my work at Confluent over the past
> > couple years. :)
> >
> >> We only persist the information for the length of time we retain
> > snapshots.
> > This seems a bit contradictory to me. We are going to persist
> (potentially)
> > useless information if we have no signal if the producer is still active.
> > This is the problem we have with old clients. We are always going to have
> > to draw the line for how long we allow a producer to have a gap in
> > producing vs how long we allow filling up with short-lived producers that
> > risk OOM.
> >
> > With an LRU cache, we run into the same problem, as we will expire all
> > "well-behaved" infrequent producers that last produced before the burst
> of
> > short-lived clients. The benefit is that we don't have a solid line in
> the
> > sand and we only expire when we need to, but we will still risk expiring
> > active producers.
> >
> > I am willing to discuss some solutions that work with older clients, but
> my
> > concern is spending too much time on a complicated solution and not
> > encouraging movement to newer and better clients.
> >
> > Justine
> >
> > On Mon, May 20, 2024 at 9:35 AM Claude Warren  wrote:
> >
> >>>
> >>> Why should we persist useless information
> >>> for clients that are long gone and will never use it?
> >>
> >>
> >> We are not.  We only persist the information for the length of time we
> >> retain snapshots.   The change here is to make the snapshots work as
> longer
> >> term storage for infrequent producers and others would would be
> negatively
> >> affected by some of the solutions proposed.
> >>
> >> Your changes require changes in the clients.   Older clients will not be
> >> able to participate.  My change does not require client change.
> >> There are issues outside of the ones discussed.  I was told of this late
> >> last week.  I will endeavor to find someone with first hand knowledge of
> >> the issue and have them report on this thread.
> >>
> >> In addition, the use of an LRU amortizes the cache cleanup so we don't
> need
> >> a thread to expire things.  You still have the cache, the point is that
> it
> >> really is a cache, there is storage behind it.  Let the cache be a
> cache,
> >> let the snapshots be the storage backing behind the cache.
> >>
> >> On Fri, May 17, 2024 at 5:26 PM Justine Olshan
> >> 
> >> wrote:
> >>
> >>> Respectfully, I don't agree. Why should we persist useless information
> >>> for clients that are long gone and will never use it?
> >>> This is why I'm suggesting we do something smarter when it comes to
> >> storing
> >>> data and only store data we actually need and have a use for.
> >>>
> >>> This is why I suggest the heartbeat. It gives us clear information (up
> to
> >>> the heartbeat interval) of which producers are worth keeping and which
> >> that
> >>> are not.
> >>> I'm not in favor of building a new and complicated system to try to
> guess
> >>> which information is needed. In my mind, if we have a ton of
> legitimately
> >>> active producers, we should scale up memory. If we don't there is no
> >> reason
> >>> to have high memory usage.
> >>>
> >>> Fixing the client also allows us to fix some of the other issues we
> have
> >>> with idempotent producers.
> >>>
> >>> Justine
> >>>
> >>> On Fri, May 17, 2024 at 12:46 AM Claude Warren 
> wrote:
> >>>
>  I think that the point here is that the design that assumes that you
> >> can
>  keep all the PIDs in memory for all server configurations and all
> >> usages
>  and all client implementations is fraught with danger.
> 
>  Yes, there are solutions already in place (KIP-854) that attempt to
> >>> address
>  this problem, and other proposed solutions to remove that have
> >>> undesirable
>  side effects (e.g. Heartbeat interrupted by IP failure for a slow
> >>> producer
>  with a long delay between posts).  KAFKA-16229 (Slow expiration of
> >>> Producer
>  IDs leading to high CPU usage) dealt with how to expire data from the
> >>> cache
>  so that there was minimal lag time.
> 
>  But the net issue is still the underlying design/architecture.
> 
>  There are a  couple of salient points here:
> 
>    - The state of a state machine is only a view on its transactions.
> >>> This
>    is the classic stream / table dichotomy.
>    - What the "cache" is trying to do is create that view.
>    - In some cases the size of the state exceeds the storage of the
> >> cache
>    and the systems fail.
>   

Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2920

2024-05-20 Thread Apache Jenkins Server
See 




Re: [DISCUSS] KIP-1044: A proposal to change idempotent producer -- server implementation

2024-05-20 Thread Omnia Ibrahim
Hi Justine are you aware of anyone looking into such new protocol at the moment?

> On 20 May 2024, at 18:00, Justine Olshan  wrote:
> 
> I would say I have first hand knowledge of this issue as someone who
> responds to such incidents as part of my work at Confluent over the past
> couple years. :)
> 
>> We only persist the information for the length of time we retain
> snapshots.
> This seems a bit contradictory to me. We are going to persist (potentially)
> useless information if we have no signal if the producer is still active.
> This is the problem we have with old clients. We are always going to have
> to draw the line for how long we allow a producer to have a gap in
> producing vs how long we allow filling up with short-lived producers that
> risk OOM.
> 
> With an LRU cache, we run into the same problem, as we will expire all
> "well-behaved" infrequent producers that last produced before the burst of
> short-lived clients. The benefit is that we don't have a solid line in the
> sand and we only expire when we need to, but we will still risk expiring
> active producers.
> 
> I am willing to discuss some solutions that work with older clients, but my
> concern is spending too much time on a complicated solution and not
> encouraging movement to newer and better clients.
> 
> Justine
> 
> On Mon, May 20, 2024 at 9:35 AM Claude Warren  wrote:
> 
>>> 
>>> Why should we persist useless information
>>> for clients that are long gone and will never use it?
>> 
>> 
>> We are not.  We only persist the information for the length of time we
>> retain snapshots.   The change here is to make the snapshots work as longer
>> term storage for infrequent producers and others would would be negatively
>> affected by some of the solutions proposed.
>> 
>> Your changes require changes in the clients.   Older clients will not be
>> able to participate.  My change does not require client change.
>> There are issues outside of the ones discussed.  I was told of this late
>> last week.  I will endeavor to find someone with first hand knowledge of
>> the issue and have them report on this thread.
>> 
>> In addition, the use of an LRU amortizes the cache cleanup so we don't need
>> a thread to expire things.  You still have the cache, the point is that it
>> really is a cache, there is storage behind it.  Let the cache be a cache,
>> let the snapshots be the storage backing behind the cache.
>> 
>> On Fri, May 17, 2024 at 5:26 PM Justine Olshan
>> 
>> wrote:
>> 
>>> Respectfully, I don't agree. Why should we persist useless information
>>> for clients that are long gone and will never use it?
>>> This is why I'm suggesting we do something smarter when it comes to
>> storing
>>> data and only store data we actually need and have a use for.
>>> 
>>> This is why I suggest the heartbeat. It gives us clear information (up to
>>> the heartbeat interval) of which producers are worth keeping and which
>> that
>>> are not.
>>> I'm not in favor of building a new and complicated system to try to guess
>>> which information is needed. In my mind, if we have a ton of legitimately
>>> active producers, we should scale up memory. If we don't there is no
>> reason
>>> to have high memory usage.
>>> 
>>> Fixing the client also allows us to fix some of the other issues we have
>>> with idempotent producers.
>>> 
>>> Justine
>>> 
>>> On Fri, May 17, 2024 at 12:46 AM Claude Warren  wrote:
>>> 
 I think that the point here is that the design that assumes that you
>> can
 keep all the PIDs in memory for all server configurations and all
>> usages
 and all client implementations is fraught with danger.
 
 Yes, there are solutions already in place (KIP-854) that attempt to
>>> address
 this problem, and other proposed solutions to remove that have
>>> undesirable
 side effects (e.g. Heartbeat interrupted by IP failure for a slow
>>> producer
 with a long delay between posts).  KAFKA-16229 (Slow expiration of
>>> Producer
 IDs leading to high CPU usage) dealt with how to expire data from the
>>> cache
 so that there was minimal lag time.
 
 But the net issue is still the underlying design/architecture.
 
 There are a  couple of salient points here:
 
   - The state of a state machine is only a view on its transactions.
>>> This
   is the classic stream / table dichotomy.
   - What the "cache" is trying to do is create that view.
   - In some cases the size of the state exceeds the storage of the
>> cache
   and the systems fail.
   - The current solutions have attempted to place limits on the size
>> of
   the state.
   - Errors in implementation and or configuration will eventually lead
>>> to
   "problem producers"
   - Under the adopted fixes and current slate of proposals, the
>> "problem
   producers" solutions have cascading side effects on properly behaved
   producers. (e.g. dropping long running, slow producing producers)
 

Re: [DISCUSS] KIP-1044: A proposal to change idempotent producer -- server implementation

2024-05-20 Thread Justine Olshan
I would say I have first hand knowledge of this issue as someone who
responds to such incidents as part of my work at Confluent over the past
couple years. :)

> We only persist the information for the length of time we retain
snapshots.
This seems a bit contradictory to me. We are going to persist (potentially)
useless information if we have no signal if the producer is still active.
This is the problem we have with old clients. We are always going to have
to draw the line for how long we allow a producer to have a gap in
producing vs how long we allow filling up with short-lived producers that
risk OOM.

With an LRU cache, we run into the same problem, as we will expire all
"well-behaved" infrequent producers that last produced before the burst of
short-lived clients. The benefit is that we don't have a solid line in the
sand and we only expire when we need to, but we will still risk expiring
active producers.

I am willing to discuss some solutions that work with older clients, but my
concern is spending too much time on a complicated solution and not
encouraging movement to newer and better clients.

Justine

On Mon, May 20, 2024 at 9:35 AM Claude Warren  wrote:

> >
> >  Why should we persist useless information
> > for clients that are long gone and will never use it?
>
>
> We are not.  We only persist the information for the length of time we
> retain snapshots.   The change here is to make the snapshots work as longer
> term storage for infrequent producers and others would would be negatively
> affected by some of the solutions proposed.
>
> Your changes require changes in the clients.   Older clients will not be
> able to participate.  My change does not require client change.
> There are issues outside of the ones discussed.  I was told of this late
> last week.  I will endeavor to find someone with first hand knowledge of
> the issue and have them report on this thread.
>
> In addition, the use of an LRU amortizes the cache cleanup so we don't need
> a thread to expire things.  You still have the cache, the point is that it
> really is a cache, there is storage behind it.  Let the cache be a cache,
> let the snapshots be the storage backing behind the cache.
>
> On Fri, May 17, 2024 at 5:26 PM Justine Olshan
> 
> wrote:
>
> > Respectfully, I don't agree. Why should we persist useless information
> > for clients that are long gone and will never use it?
> > This is why I'm suggesting we do something smarter when it comes to
> storing
> > data and only store data we actually need and have a use for.
> >
> > This is why I suggest the heartbeat. It gives us clear information (up to
> > the heartbeat interval) of which producers are worth keeping and which
> that
> > are not.
> > I'm not in favor of building a new and complicated system to try to guess
> > which information is needed. In my mind, if we have a ton of legitimately
> > active producers, we should scale up memory. If we don't there is no
> reason
> > to have high memory usage.
> >
> > Fixing the client also allows us to fix some of the other issues we have
> > with idempotent producers.
> >
> > Justine
> >
> > On Fri, May 17, 2024 at 12:46 AM Claude Warren  wrote:
> >
> > > I think that the point here is that the design that assumes that you
> can
> > > keep all the PIDs in memory for all server configurations and all
> usages
> > > and all client implementations is fraught with danger.
> > >
> > > Yes, there are solutions already in place (KIP-854) that attempt to
> > address
> > > this problem, and other proposed solutions to remove that have
> > undesirable
> > > side effects (e.g. Heartbeat interrupted by IP failure for a slow
> > producer
> > > with a long delay between posts).  KAFKA-16229 (Slow expiration of
> > Producer
> > > IDs leading to high CPU usage) dealt with how to expire data from the
> > cache
> > > so that there was minimal lag time.
> > >
> > > But the net issue is still the underlying design/architecture.
> > >
> > > There are a  couple of salient points here:
> > >
> > >- The state of a state machine is only a view on its transactions.
> > This
> > >is the classic stream / table dichotomy.
> > >- What the "cache" is trying to do is create that view.
> > >- In some cases the size of the state exceeds the storage of the
> cache
> > >and the systems fail.
> > >- The current solutions have attempted to place limits on the size
> of
> > >the state.
> > >- Errors in implementation and or configuration will eventually lead
> > to
> > >"problem producers"
> > >- Under the adopted fixes and current slate of proposals, the
> "problem
> > >producers" solutions have cascading side effects on properly behaved
> > >producers. (e.g. dropping long running, slow producing producers)
> > >
> > > For decades (at least since the 1980's and anecdotally since the
> 1960's)
> > > there has been a solution to processing state where the size of the
> state
> > > exceeded the memory 

Re: [DISCUSS] KIP-1044: A proposal to change idempotent producer -- server implementation

2024-05-20 Thread Claude Warren
>
>  Why should we persist useless information
> for clients that are long gone and will never use it?


We are not.  We only persist the information for the length of time we
retain snapshots.   The change here is to make the snapshots work as longer
term storage for infrequent producers and others would would be negatively
affected by some of the solutions proposed.

Your changes require changes in the clients.   Older clients will not be
able to participate.  My change does not require client change.
There are issues outside of the ones discussed.  I was told of this late
last week.  I will endeavor to find someone with first hand knowledge of
the issue and have them report on this thread.

In addition, the use of an LRU amortizes the cache cleanup so we don't need
a thread to expire things.  You still have the cache, the point is that it
really is a cache, there is storage behind it.  Let the cache be a cache,
let the snapshots be the storage backing behind the cache.

On Fri, May 17, 2024 at 5:26 PM Justine Olshan 
wrote:

> Respectfully, I don't agree. Why should we persist useless information
> for clients that are long gone and will never use it?
> This is why I'm suggesting we do something smarter when it comes to storing
> data and only store data we actually need and have a use for.
>
> This is why I suggest the heartbeat. It gives us clear information (up to
> the heartbeat interval) of which producers are worth keeping and which that
> are not.
> I'm not in favor of building a new and complicated system to try to guess
> which information is needed. In my mind, if we have a ton of legitimately
> active producers, we should scale up memory. If we don't there is no reason
> to have high memory usage.
>
> Fixing the client also allows us to fix some of the other issues we have
> with idempotent producers.
>
> Justine
>
> On Fri, May 17, 2024 at 12:46 AM Claude Warren  wrote:
>
> > I think that the point here is that the design that assumes that you can
> > keep all the PIDs in memory for all server configurations and all usages
> > and all client implementations is fraught with danger.
> >
> > Yes, there are solutions already in place (KIP-854) that attempt to
> address
> > this problem, and other proposed solutions to remove that have
> undesirable
> > side effects (e.g. Heartbeat interrupted by IP failure for a slow
> producer
> > with a long delay between posts).  KAFKA-16229 (Slow expiration of
> Producer
> > IDs leading to high CPU usage) dealt with how to expire data from the
> cache
> > so that there was minimal lag time.
> >
> > But the net issue is still the underlying design/architecture.
> >
> > There are a  couple of salient points here:
> >
> >- The state of a state machine is only a view on its transactions.
> This
> >is the classic stream / table dichotomy.
> >- What the "cache" is trying to do is create that view.
> >- In some cases the size of the state exceeds the storage of the cache
> >and the systems fail.
> >- The current solutions have attempted to place limits on the size of
> >the state.
> >- Errors in implementation and or configuration will eventually lead
> to
> >"problem producers"
> >- Under the adopted fixes and current slate of proposals, the "problem
> >producers" solutions have cascading side effects on properly behaved
> >producers. (e.g. dropping long running, slow producing producers)
> >
> > For decades (at least since the 1980's and anecdotally since the 1960's)
> > there has been a solution to processing state where the size of the state
> > exceeded the memory available.  It is the solution that drove the idea
> that
> > you could have tables in Kafka.  The idea that we can store the hot PIDs
> in
> > memory using an LRU and write data to storage so that we can quickly find
> > things not in the cache is not new.  It has been proven.
> >
> > I am arguing that we should not throw away state data because we are
> > running out of memory.  We should persist that data to disk and consider
> > the disk as the source of truth for state.
> >
> > Claude
> >
> >
> > On Wed, May 15, 2024 at 7:42 PM Justine Olshan
> > 
> > wrote:
> >
> > > +1 to the comment.
> > >
> > > > I still feel we are doing all of this only because of a few
> > anti-pattern
> > > or misconfigured producers and not because we have “too many Producer”.
> > I
> > > believe that implementing Producer heartbeat and remove short-lived
> PIDs
> > > from the cache if we didn’t receive heartbeat will be more simpler and
> > step
> > > on right direction  to improve idempotent logic and maybe try to make
> PID
> > > get reused between session which will implement a real idempotent
> > producer
> > > instead of idempotent session.  I admit this wouldn’t help with old
> > clients
> > > but it will put us on the right path.
> > >
> > > This issue is very complicated and I appreciate the attention on it.
> > > Hopefully we can find a good solution working 

Re: [VOTE] KIP-1038: Add Custom Error Handler to Producer

2024-05-20 Thread Kirk True
+1 (non-binding)

Thanks Alieh!

> On May 20, 2024, at 6:00 AM, Walker Carlson  
> wrote:
> 
> Hey Alieh,
> 
> Thanks for the KIP.
> 
> +1 binding
> 
> Walker
> 
> On Tue, May 7, 2024 at 10:57 AM Alieh Saeedi 
> wrote:
> 
>> Hi all,
>> 
>> It seems that we have no more comments, discussions, or feedback on
>> KIP-1038; therefore, I’d like to open voting for the KIP: Add Custom Error
>> Handler to Producer
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1038%3A+Add+Custom+Error+Handler+to+Producer
>>> 
>> 
>> 
>> Cheers,
>> Alieh
>> 



Re: [DISCUSS] KIP-1044: A proposal to change idempotent producer -- server implementation

2024-05-20 Thread Omnia Ibrahim
I agree with Justine, especially considering that the number of producers on a 
Kafka cluster is usually very limited. 
This makes me think we are focusing on the symptom in my KIP-936 and this KIP, 
which is the memory issue, instead of addressing the root cause.
The root cause is that the idempotent session protocol tolerates and allows 
short-lived session metadata (PID and state) to crowd the cluster.

So, similar to Justine, I am more in favour of considering a new protocol that 
offers true idempotent producer capabilities across sessions, that can also 
truly identify the producer client.
However, if we can address the memory issue in the meantime with a simple 
solution to protect against the anti-patterns/misconfigured use-cases of the 
idempotent session protocol, 
this would be a win until we come up with a new protocol.
So far, both KIP-936 and this KIP propose somewhat complicated solutions to 
this symptom.

Maybe we should divide the focus into:
• Finding a simple way to protect against OOM caused by short-lived 
idempotent sessions. This might be a bit complicated as identifying short-lived 
producers is tricky with the current protocol. 
For instance, we could revisit the reject alternative in KIP-936 to 
throttle INIT_PID requests. It is not perfect, but it is the simplest.
• Developing Idempotent Protocol V2 that addresses this issue and the 
client-side issues with idempotent producers. @Justine are you aware of anyone 
looking into such protocol at the moment in details? 

Omnia

> On 17 May 2024, at 16:26, Justine Olshan  wrote:
> 
> Respectfully, I don't agree. Why should we persist useless information
> for clients that are long gone and will never use it?
> This is why I'm suggesting we do something smarter when it comes to storing
> data and only store data we actually need and have a use for.
> 
> This is why I suggest the heartbeat. It gives us clear information (up to
> the heartbeat interval) of which producers are worth keeping and which that
> are not.
> I'm not in favor of building a new and complicated system to try to guess
> which information is needed. In my mind, if we have a ton of legitimately
> active producers, we should scale up memory. If we don't there is no reason
> to have high memory usage.
> 
> Fixing the client also allows us to fix some of the other issues we have
> with idempotent producers.
> 
> Justine
> 
> On Fri, May 17, 2024 at 12:46 AM Claude Warren  wrote:
> 
>> I think that the point here is that the design that assumes that you can
>> keep all the PIDs in memory for all server configurations and all usages
>> and all client implementations is fraught with danger.
>> 
>> Yes, there are solutions already in place (KIP-854) that attempt to address
>> this problem, and other proposed solutions to remove that have undesirable
>> side effects (e.g. Heartbeat interrupted by IP failure for a slow producer
>> with a long delay between posts).  KAFKA-16229 (Slow expiration of Producer
>> IDs leading to high CPU usage) dealt with how to expire data from the cache
>> so that there was minimal lag time.
>> 
>> But the net issue is still the underlying design/architecture.
>> 
>> There are a  couple of salient points here:
>> 
>>   - The state of a state machine is only a view on its transactions.  This
>>   is the classic stream / table dichotomy.
>>   - What the "cache" is trying to do is create that view.
>>   - In some cases the size of the state exceeds the storage of the cache
>>   and the systems fail.
>>   - The current solutions have attempted to place limits on the size of
>>   the state.
>>   - Errors in implementation and or configuration will eventually lead to
>>   "problem producers"
>>   - Under the adopted fixes and current slate of proposals, the "problem
>>   producers" solutions have cascading side effects on properly behaved
>>   producers. (e.g. dropping long running, slow producing producers)
>> 
>> For decades (at least since the 1980's and anecdotally since the 1960's)
>> there has been a solution to processing state where the size of the state
>> exceeded the memory available.  It is the solution that drove the idea that
>> you could have tables in Kafka.  The idea that we can store the hot PIDs in
>> memory using an LRU and write data to storage so that we can quickly find
>> things not in the cache is not new.  It has been proven.
>> 
>> I am arguing that we should not throw away state data because we are
>> running out of memory.  We should persist that data to disk and consider
>> the disk as the source of truth for state.
>> 
>> Claude
>> 
>> 
>> On Wed, May 15, 2024 at 7:42 PM Justine Olshan
>> 
>> wrote:
>> 
>>> +1 to the comment.
>>> 
 I still feel we are doing all of this only because of a few
>> anti-pattern
>>> or misconfigured producers and not because we have “too many Producer”.
>> I
>>> believe that implementing Producer heartbeat and remove short-lived PIDs
>>> from the cache if we didn’t receive 

[jira] [Reopened] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-20 Thread Igor Soarez (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Soarez reopened KAFKA-16692:
-

Re-opening as 3.6 backport is still missing

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.7.0, 3.6.1, 3.8
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
> Fix For: 3.7.1, 3.8
>
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that has _discoverBrokerVersions_ set to {_}false{_}. 
> I was hoping I could get some assistance debugging this issue. Happy to 
> provide any additional information needed.
>  
>  
>  



--
This message was sent by Atlassian Jira

Re: [PR] Add Skillsoft to "Powered By" [kafka-site]

2024-05-20 Thread via GitHub


brandon-powers commented on PR #601:
URL: https://github.com/apache/kafka-site/pull/601#issuecomment-2120763191

   Hi team  - when someone has a moment to take a look, feel free to ping me 
if there's anything else required.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [DISCUSS] KIP-1031: Control offset translation in MirrorSourceConnector

2024-05-20 Thread Chia-Ping Tsai
Nice KIP. some minor comments/questions are listed below.

1) It seems we can disable the sync of idle consumers by setting 
`sync.group.offsets.interval.seconds` to -1, so the fail-fast should include 
sync.group.offsets.interval.seconds too. For another, maybe we should do 
fail-fast for MirrorCheckpointConnector even though we don't have this KIP

2) Should we do similar fail-fast for MirrorSourceConnector if user set custom 
producer configs with emit.offset-syncs.enabled=false? I assume the producer 
which sending records to offset-syncs topic won't be created if 
emit.offset-syncs.enabled=false

3) Should we simplify the SourceRecord if emit.offset-syncs.enabled=false? 
Maybe that can get a bit performance improvement.

Best,
Chia-Ping

On 2024/04/08 10:03:50 Omnia Ibrahim wrote:
> Hi Chris, 
> Validation method is a good call. I updated the KIP to state that the 
> checkpoint connector will fail if the configs aren’t correct. And updated the 
> description of the new config to explain the impact of it on checkpoint 
> connector as well. 
> 
> If there is no any other feedback from anyone I would like to start the 
> voting thread in few days. 
> Thanks 
> Omnia
> 
> > On 8 Apr 2024, at 06:31, Chris Egerton  wrote:
> > 
> > Hi Omnia,
> > 
> > Ah, good catch. I think failing to start the checkpoint connector if offset
> > syncs are disabled is fine. We'd probably want to do that via the
> > Connector::validate [1] method in order to be able to catch invalid configs
> > during preflight validation, but it's not necessary to get that specific in
> > the KIP (especially since we may add other checks as well).
> > 
> > [1] -
> > https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/connector/Connector.html#validate(java.util.Map)
> > 
> > Cheers,
> > 
> > Chris
> > 
> > On Thu, Apr 4, 2024 at 8:07 PM Omnia Ibrahim 
> > wrote:
> > 
> >> Thanks Chris for the feedback
> >>> 1. It'd be nice to mention that increasing the max offset lag to INT_MAX
> >>> could work as a partial workaround for users on existing versions (though
> >>> of course this wouldn't prevent creation of the syncs topic).
> >> I updated the KIP
> >> 
> >>> 2. Will it be illegal to disable offset syncs if other features that rely
> >>> on offset syncs are explicitly enabled in the connector config? If
> >> they're
> >>> not explicitly enabled then it should probably be fine to silently
> >> disable
> >>> them, but I'd be interested in your thoughts.
> >> The rest of the features that relays on this is controlled by
> >> emit.checkpoints.enabled (enabled by default) and
> >> sync.group.offsets.enabled (disabled by default) which are part of
> >> MirrorCheckpointConnector config not MirrorSourceConnector, I was thinking
> >> that MirrorCheckpointConnector should fail to start if
> >> emit.offset-syncs.enabled is disabled while emit.checkpoints.enabled and/or
> >> sync.group.offsets.enabled are enabled as no point of creating this
> >> connector if the main part is disabled. WDYT?
> >> 
> >> Thanks
> >> Omnia
> >> 
> >>> On 3 Apr 2024, at 12:45, Chris Egerton  wrote:
> >>> 
> >>> Hi Omnia,
> >>> 
> >>> Thanks for the KIP! Two small things come to mind:
> >>> 
> >>> 1. It'd be nice to mention that increasing the max offset lag to INT_MAX
> >>> could work as a partial workaround for users on existing versions (though
> >>> of course this wouldn't prevent creation of the syncs topic).
> >>> 
> >>> 2. Will it be illegal to disable offset syncs if other features that rely
> >>> on offset syncs are explicitly enabled in the connector config? If
> >> they're
> >>> not explicitly enabled then it should probably be fine to silently
> >> disable
> >>> them, but I'd be interested in your thoughts.
> >>> 
> >>> Cheers,
> >>> 
> >>> Chris
> >>> 
> >>> On Wed, Apr 3, 2024, 20:41 Luke Chen  wrote:
> >>> 
>  Hi Omnia,
>  
>  Thanks for the KIP!
>  It LGTM!
>  But I'm not an expert of MM2, it would be good to see if there is any
> >> other
>  comment from MM2 experts.
>  
>  Thanks.
>  Luke
>  
>  On Thu, Mar 14, 2024 at 6:08 PM Omnia Ibrahim 
>  wrote:
>  
> > Hi everyone, I would like to start a discussion thread for KIP-1031:
> > Control offset translation in MirrorSourceConnector
> > 
> > 
>  
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1031%3A+Control+offset+translation+in+MirrorSourceConnector
> > 
> > Thanks
> > Omnia
> > 
>  
> >> 
> >> 
> 
> 


[jira] [Resolved] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-20 Thread Igor Soarez (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Soarez resolved KAFKA-16692.
-
Resolution: Fixed

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.7.0, 3.6.1, 3.8
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
> Fix For: 3.7.1, 3.8
>
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that has _discoverBrokerVersions_ set to {_}false{_}. 
> I was hoping I could get some assistance debugging this issue. Happy to 
> provide any additional information needed.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16645) CVEs in 3.7.0 docker image

2024-05-20 Thread Igor Soarez (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Soarez resolved KAFKA-16645.
-
  Assignee: Igor Soarez
Resolution: Won't Fix

The vulnerability has already been addressed in the base image, under the same 
image tag, so the next published Kafka images will not contain ship the 
vulnerability.

We do not republish previous releases, so we're not taking any action here.

> CVEs in 3.7.0 docker image
> --
>
> Key: KAFKA-16645
> URL: https://issues.apache.org/jira/browse/KAFKA-16645
> Project: Kafka
>  Issue Type: Task
>Affects Versions: 3.7.0
>Reporter: Mickael Maison
>Assignee: Igor Soarez
>Priority: Blocker
> Fix For: 3.8.0, 3.7.1
>
>
> Our [Docker Image CVE 
> Scanner|https://github.com/apache/kafka/actions/runs/874393] GitHub 
> action reports 2 high CVEs in our base image:
> apache/kafka:3.7.0 (alpine 3.19.1)
> ==
> Total: 2 (HIGH: 2, CRITICAL: 0)
> ┌──┬┬──┬┬───┬───┬─┐
> │ Library  │ Vulnerability  │ Severity │ Status │ Installed Version │ Fixed 
> Version │Title│
> ├──┼┼──┼┼───┼───┼─┤
> │ libexpat │ CVE-2023-52425 │ HIGH │ fixed  │ 2.5.0-r2  │ 
> 2.6.0-r0  │ expat: parsing large tokens can trigger a denial of service │
> │  ││  ││   │ 
>   │ https://avd.aquasec.com/nvd/cve-2023-52425  │
> │  ├┤  ││   
> ├───┼─┤
> │  │ CVE-2024-28757 │  ││   │ 
> 2.6.2-r0  │ expat: XML Entity Expansion │
> │  ││  ││   │ 
>   │ https://avd.aquasec.com/nvd/cve-2024-28757  │
> └──┴┴──┴┴───┴───┴─┘
> Looking at the 
> [KIP|https://cwiki.apache.org/confluence/display/KAFKA/KIP-975%3A+Docker+Image+for+Apache+Kafka#KIP975:DockerImageforApacheKafka-WhatifweobserveabugoracriticalCVEinthereleasedApacheKafkaDockerImage?]
>  that introduced the docker images, it seems we should release a bugfix when 
> high CVEs are detected. It would be good to investigate and assess whether 
> Kafka is impacted or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] MINOR - Move CSP header from _header.htm to .htaccess [kafka-site]

2024-05-20 Thread via GitHub


brandboat commented on PR #602:
URL: https://github.com/apache/kafka-site/pull/602#issuecomment-2120632215

   gentle ping @raboof, would you mind take a look ? I saw you left a comment 
in https://github.com/apache/kafka-site/pull/597#issuecomment-2075421541, and 
this pr seems done what you addressed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [VOTE] KIP-1040: Improve handling of nullable values in InsertField, ExtractField, and other transformations

2024-05-20 Thread Chris Egerton
Thanks for the KIP! +1 (binding)

On Mon, May 20, 2024 at 4:22 AM Mario Fiore Vitale 
wrote:

> Hi everyone,
>
> I'd like to call a vote on KIP-1040 which aims to improve handling of
> nullable values in InsertField, ExtractField, and other transformations
>
> KIP -
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=303794677
>
> Discussion thread -
> https://lists.apache.org/thread/ggqqqjbg6ccpz8g6ztyj7oxr80q5184n
>
> Thanks and regards,
> Mario
>


[jira] [Resolved] (KAFKA-16603) Data loss when kafka connect sending data to Kafka

2024-05-20 Thread Chris Egerton (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Egerton resolved KAFKA-16603.
---
Resolution: Not A Bug

> Data loss when kafka connect sending data to Kafka
> --
>
> Key: KAFKA-16603
> URL: https://issues.apache.org/jira/browse/KAFKA-16603
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 3.3.1
>Reporter: Anil Dasari
>Priority: Major
>
> We are experiencing a data loss when Kafka Source connector is failed to send 
> data to Kafka topic and offset topic. 
> Kafka cluster and Kafka connect details:
>  # Kafka connect version i.e client : Confluent community version 7.3.1 i.e 
> Kafka 3.3.1
>  # Kafka version: 0.11.0 (server)
>  # Cluster size : 3 brokers
>  # Number of partitions in all topics = 3
>  # Replication factor = 3
>  # Min ISR set 2
>  # Uses no transformations in Kafka connector
>  # Use default error tolerance i.e None.
> Our connector checkpoints the offsets info received in 
> SourceTask#commitRecord and resume the data process from the persisted 
> checkpoint.
> The data loss is noticed when broker is unresponsive for few mins due to high 
> load and kafka connector was restarted. Also, Kafka connector graceful 
> shutdown failed.
> Logs:
>  
> {code:java}
> [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Discovered group 
> coordinator 10.75.100.176:31000 (id: 2147483647 rack: null)
> Apr 22, 2024 @ 15:56:16.152 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Group coordinator 
> 10.75.100.176:31000 (id: 2147483647 rack: null) is unavailable or invalid due 
> to cause: coordinator unavailable. isDisconnected: false. Rediscovery will be 
> attempted.
> Apr 22, 2024 @ 15:56:16.153 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Requesting disconnect from 
> last known coordinator 10.75.100.176:31000 (id: 2147483647 rack: null)
> Apr 22, 2024 @ 15:56:16.514 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Node 0 disconnected.
> Apr 22, 2024 @ 15:56:16.708 [Producer 
> clientId=connector-producer-d094a5d7bbb046b99d62398cb84d648c-0] Node 0 
> disconnected.
> Apr 22, 2024 @ 15:56:16.710 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Node 2147483647 
> disconnected.
> Apr 22, 2024 @ 15:56:16.731 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Group coordinator 
> 10.75.100.176:31000 (id: 2147483647 rack: null) is unavailable or invalid due 
> to cause: coordinator unavailable. isDisconnected: true. Rediscovery will be 
> attempted.
> Apr 22, 2024 @ 15:56:19.103 == Trying to sleep while stop == (** custom log 
> **)
> Apr 22, 2024 @ 15:56:19.755 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Broker coordinator was 
> unreachable for 3000ms. Revoking previous assignment Assignment{error=0, 
> leader='connect-1-8f41a1d2-6cc9-4956-9be3-1fbae9c6d305', 
> leaderUrl='http://10.75.100.46:8083/', offset=4, 
> connectorIds=[d094a5d7bbb046b99d62398cb84d648c], 
> taskIds=[d094a5d7bbb046b99d62398cb84d648c-0], revokedConnectorIds=[], 
> revokedTaskIds=[], delay=0} to avoid running tasks while not being a member 
> the group
> Apr 22, 2024 @ 15:56:19.866 Stopping connector 
> d094a5d7bbb046b99d62398cb84d648c
> Apr 22, 2024 @ 15:56:19.874 Stopping task d094a5d7bbb046b99d62398cb84d648c-0
> Apr 22, 2024 @ 15:56:19.880 Scheduled shutdown for 
> WorkerConnectorWorkerConnector{id=d094a5d7bbb046b99d62398cb84d648c}
> Apr 22, 2024 @ 15:56:24.105 Connector 'd094a5d7bbb046b99d62398cb84d648c' 
> failed to properly shut down, has become unresponsive, and may be consuming 
> external resources. Correct the configuration for this connector or remove 
> the connector. After fixing the connector, it may be necessary to restart 
> this worker to release any consumed resources.
> Apr 22, 2024 @ 15:56:24.110 [Producer 
> clientId=connector-producer-d094a5d7bbb046b99d62398cb84d648c-0] Closing the 
> Kafka producer with timeoutMillis = 0 ms.
> Apr 22, 2024 @ 15:56:24.110 [Producer 
> clientId=connector-producer-d094a5d7bbb046b99d62398cb84d648c-0] Proceeding to 
> force close the producer since pending requests could not be completed within 
> timeout 0 ms.
> Apr 22, 2024 @ 15:56:24.112 [Producer 
> clientId=connector-producer-d094a5d7bbb046b99d62398cb84d648c-0] Beginning 
> shutdown of Kafka producer I/O thread, sending remaining records.
> Apr 22, 2024 @ 15:56:24.112 [Producer 
> clientId=connector-producer-d094a5d7bbb046b99d62398cb84d648c-0] Aborting 
> incomplete batches due to forced shutdown
> Apr 22, 2024 @ 15:56:24.113 
> WorkerSourceTaskWorkerSourceTask{id=d094a5d7bbb046b99d62398cb84d648c-0} 
> 

[jira] [Resolved] (KAFKA-16656) Using a custom replication.policy.separator with DefaultReplicationPolicy

2024-05-20 Thread Chris Egerton (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Egerton resolved KAFKA-16656.
---
Resolution: Not A Bug

> Using a custom replication.policy.separator with DefaultReplicationPolicy
> -
>
> Key: KAFKA-16656
> URL: https://issues.apache.org/jira/browse/KAFKA-16656
> Project: Kafka
>  Issue Type: Bug
>  Components: mirrormaker
>Affects Versions: 3.5.1
>Reporter: Lenin Joseph
>Priority: Major
>
> Hi,
> In the case of bidirectional replication using mm2, when we tried using a 
> custom replication.policy.separator( ex: "-") with DefaultReplicationPolicy , 
> we see cyclic replication of topics. Could you confirm whether it's mandatory 
> to use a CustomReplicationPolicy whenever we want to use a separator other 
> than a "." ?
> Regards, 
> Lenin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-1038: Add Custom Error Handler to Producer

2024-05-20 Thread Walker Carlson
Hey Alieh,

Thanks for the KIP.

+1 binding

Walker

On Tue, May 7, 2024 at 10:57 AM Alieh Saeedi 
wrote:

> Hi all,
>
> It seems that we have no more comments, discussions, or feedback on
> KIP-1038; therefore, I’d like to open voting for the KIP: Add Custom Error
> Handler to Producer
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1038%3A+Add+Custom+Error+Handler+to+Producer
> >
>
>
> Cheers,
> Alieh
>


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2919

2024-05-20 Thread Apache Jenkins Server
See 




Re: [VOTE] KIP-950: Tiered Storage Disablement

2024-05-20 Thread Satish Duggana
+1
Thanks Christo for addressing the review comments. We can update the
KIP for any minor comments/clarifications.


On Thu, 16 May 2024 at 15:21, Luke Chen  wrote:
>
> Thanks Chia-Ping!
> Since ZK is going to be removed, I agree the KRaft part has higher priority.
> But if Christo or the community contributor has spare time, it's good to
> have ZK part, too!
>
> Thanks.
> Luke
>
> On Thu, May 16, 2024 at 5:45 PM Chia-Ping Tsai  wrote:
>
> > +1 but I prefer to ship it to KRaft only.
> >
> > I do concern that community have enough time to accept more feature in 3.8
> > :(
> >
> > Best,
> > Chia-Ping
> >
> > On 2024/05/14 15:20:50 Christo Lolov wrote:
> > > Heya!
> > >
> > > I would like to start a vote on KIP-950: Tiered Storage Disablement in
> > > order to catch the last Kafka release targeting Zookeeper -
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
> > >
> > > Best,
> > > Christo
> > >
> >


[jira] [Created] (KAFKA-16798) Mirrormaker2 consumer groups sync.group.offsets.interval not working

2024-05-20 Thread Thanos Athanasopoulos (Jira)
Thanos Athanasopoulos created KAFKA-16798:
-

 Summary: Mirrormaker2 consumer groups sync.group.offsets.interval 
not working
 Key: KAFKA-16798
 URL: https://issues.apache.org/jira/browse/KAFKA-16798
 Project: Kafka
  Issue Type: Bug
  Components: mirrormaker
Affects Versions: 3.7.0
Reporter: Thanos Athanasopoulos


Single instance MirrorMaker2 in dedicated mode, active passive replication 
logic.

sync.group.offsets.interval.seconds=2 configuration is enabled and active


{noformat}
[root@x-x ~]# docker logs cc-mm 2>&1 -f | grep -i 
"auto.commit.interval\|checkpoint.interval\|consumer.commit.interval\|sync.topics.interval\|sync.group.offsets.interval\|offset-syncs.interval.seconds
                                 "
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        sync.group.offsets.interval.seconds = 2
        sync.group.offsets.interval.seconds = 2
        auto.commit.interval.ms = 5000
        sync.group.offsets.interval.seconds = 2
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        sync.group.offsets.interval.seconds = 2
        auto.commit.interval.ms = 5000
        auto.commit.interval.ms = 5000
        sync.group.offsets.interval.seconds = 2
        sync.group.offsets.interval.seconds = 2
        auto.commit.interval.ms = 5000
{noformat}


but is not working, the commit of offsets happens always in 60 seconds as you 
can see in the logs


{noformat}
[2024-05-20 09:32:44,847] INFO [MirrorSourceConnector|task-0|offsets] 
WorkerSourceTask{id=MirrorSourceConnector-0} Committing offsets for 20 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:32:44,852] INFO [MirrorSourceConnector|task-1|offsets] 
WorkerSourceTask{id=MirrorSourceConnector-1} Committing offsets for 12 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:32:44,875] INFO [MirrorHeartbeatConnector|task-0|offsets] 
WorkerSourceTask{id=MirrorHeartbeatConnector-0} Committing offsets for 12 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:32:44,881] INFO [MirrorCheckpointConnector|task-0|offsets] 
WorkerSourceTask{id=MirrorCheckpointConnector-0} Committing offsets for 1 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:32:44,890] INFO [MirrorHeartbeatConnector|task-0|offsets] 
WorkerSourceTask{id=MirrorHeartbeatConnector-0} Committing offsets for 12 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:33:44,850] INFO [MirrorSourceConnector|task-0|offsets] 
WorkerSourceTask{id=MirrorSourceConnector-0} Committing offsets for 21 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:33:44,854] INFO [MirrorSourceConnector|task-1|offsets] 
WorkerSourceTask{id=MirrorSourceConnector-1} Committing offsets for 12 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:33:44,878] INFO [MirrorHeartbeatConnector|task-0|offsets] 
WorkerSourceTask{id=MirrorHeartbeatConnector-0} Committing offsets for 12 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:33:44,883] INFO [MirrorCheckpointConnector|task-0|offsets] 
WorkerSourceTask{id=MirrorCheckpointConnector-0} Committing offsets for 2 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:33:44,895] INFO [MirrorHeartbeatConnector|task-0|offsets] 
WorkerSourceTask{id=MirrorHeartbeatConnector-0} Committing offsets for 12 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:34:44,852] INFO [MirrorSourceConnector|task-0|offsets] 
WorkerSourceTask{id=MirrorSourceConnector-0} Committing offsets for 20 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:34:44,857] INFO [MirrorSourceConnector|task-1|offsets] 
WorkerSourceTask{id=MirrorSourceConnector-1} Committing offsets for 12 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:34:44,880] INFO [MirrorHeartbeatConnector|task-0|offsets] 
WorkerSourceTask{id=MirrorHeartbeatConnector-0} Committing offsets for 12 
acknowledged messages (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2024-05-20 09:34:44,886] INFO [MirrorCheckpointConnector|task-0|offsets] 

[jira] [Resolved] (KAFKA-16797) A bit cleanup of FeatureControlManager

2024-05-20 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-16797.

Fix Version/s: 3.8.0
   Resolution: Fixed

> A bit cleanup of FeatureControlManager
> --
>
> Key: KAFKA-16797
> URL: https://issues.apache.org/jira/browse/KAFKA-16797
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Chia-Ping Tsai
>Assignee: 黃竣陽
>Priority: Trivial
> Fix For: 3.8.0
>
>
> [https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/FeatureControlManager.java#L62]
>  
> that can be replaced by `Collections.emptyIterator()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[VOTE] KIP-1040: Improve handling of nullable values in InsertField, ExtractField, and other transformations

2024-05-20 Thread Mario Fiore Vitale
Hi everyone,

I'd like to call a vote on KIP-1040 which aims to improve handling of
nullable values in InsertField, ExtractField, and other transformations

KIP -
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=303794677

Discussion thread -
https://lists.apache.org/thread/ggqqqjbg6ccpz8g6ztyj7oxr80q5184n

Thanks and regards,
Mario