[jira] [Created] (KAFKA-14550) MoveSnapshotFile and CorruptSnapshotException to storage module

2022-12-22 Thread Satish Duggana (Jira)
Satish Duggana created KAFKA-14550:
--

 Summary: MoveSnapshotFile and CorruptSnapshotException to storage 
module
 Key: KAFKA-14550
 URL: https://issues.apache.org/jira/browse/KAFKA-14550
 Project: Kafka
  Issue Type: Sub-task
  Components: core
Reporter: Satish Duggana
Assignee: Satish Duggana






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [ANNOUNCE] New committer: Josep Prat

2022-12-22 Thread Satish Duggana
Congratulations Josep!

On Wed, 21 Dec 2022 at 22:38, Yash Mayya  wrote:
>
> Congratulations Josep!
>
> On Tue, Dec 20, 2022, 22:56 Jun Rao  wrote:
>
> > Hi, Everyone,
> >
> > The PMC of Apache Kafka is pleased to announce a new Kafka committer Josep
> >  Prat.
> >
> > Josep has been contributing to Kafka since May 2021. He contributed 20 PRs
> > including the following 2 KIPs.
> >
> > KIP-773 Differentiate metric latency measured in ms and ns
> > KIP-744: Migrate TaskMetadata and ThreadMetadata to an interface with
> > internal implementation
> >
> > Congratulations, Josep!
> >
> > Thanks,
> >
> > Jun (on behalf of the Apache Kafka PMC)
> >


Re: [VOTE] KIP-837 Allow MultiCasting a Result Record.

2022-12-22 Thread Matthias J. Sax

Thanks.

Glad we could align.


-Matthias

On 12/21/22 2:09 AM, Sagar wrote:

Hi All,

Just as an update, the changes described here:

```
Changes at a high level are:

1) KeyQueryMetada enhanced to have a new method called partitions().
2) Lifting the restriction of a single partition for IQ. Now the
restriction holds only for FK Join.
```

are reverted back. As things stand,  KeyQueryMetada exposes only the
partition() method and the restriction for single partition is added back
for IQ. This has been done based on the points raised by Matthias above.

The KIP has been updated accordingly.

Thanks!
Sagar.

On Sat, Dec 10, 2022 at 12:09 AM Sagar  wrote:


Hey Matthias,

Actually I had shared the PR link for any potential issues that might have
gone missing. I guess it didn't come out that way in my response. Apologies
for that!

I am more than happy to incorporate any feedback/changes or address any
concerns that are still present around this at this point as well.

And I would keep in mind the feedback to provide more time in such a
scenario.

Thanks!
Sagar.

On Fri, Dec 9, 2022 at 11:41 PM Matthias J. Sax  wrote:


It is what it is.


we did have internal discussions on this


We sometimes have the case that a KIP need adjustment as stuff is
discovered during coding. And having a discussion on the PR about it is
fine. -- However, before the PR gets merge, the KIP change should be
announced to verify that nobody has objections to he change, before we
carry forward.

It's up to the committer who reviews/merges the PR to ensure that this
process is followed IMHO. I hope we can do better next time.

(And yes, there was the 3.4 release KIP deadline that might explain it,
but it seems important that we give enough time is make "tricky" changes
and not rush into stuff IMHO.)


-Matthias


On 12/8/22 7:04 PM, Sagar wrote:

Thanks Matthias,

Well, as things stand, we did have internal discussions on this and it
seemed ok to open it up for IQ and more importantly not ok to have it
opened up for FK-Join. And more importantly, the PR for this is already
merged and some of these things came up during that. Here's the PR link:
https://github.com/apache/kafka/pull/12803.

Thanks!
Sagar.


On Fri, Dec 9, 2022 at 5:15 AM Matthias J. Sax 

wrote:



Ah. Missed it as it does not have a nice "code block" similar to
`StreamPartitioner` changes.

I understand the motivation, but I am wondering if we might head into a
tricky direction? State stores (at least the built-in ones) and IQ are
kinda build with the idea to have sharded data and that a multi-cast of
keys is an anti-pattern?

Maybe it's fine, but I also don't want to open Pandora's Box. Are we
sure that generalizing the concepts does not cause issues in the

future?


Ie, should we claim that the multi-cast feature should be used for
KStreams only, but not for KTables?

Just want to double check that we are not doing something we regret

later.



-Matthias


On 12/7/22 6:45 PM, Sagar wrote:

Hi Mathias,

I did save it. The changes are added under Public Interfaces (Pt#2

about

enhancing KeyQueryMetadata with partitions method) and
throwing IllegalArgumentException when StreamPartitioner#partitions

method

returns multiple partitions for just FK-join instead of the earlier

decided

FK-Join and IQ.

The background is that for IQ, if the users have multi casted records

to

multiple partitions during ingestion but the fetch returns only a

single

partition, then it would be wrong. That's why the restriction was

lifted

for IQ and that's the reason KeyQueryMetadata now has another

partitions()

method to signify the same.

FK-Join also has a similar case, but while reviewing it was felt that
FK-Join on it's own is fairly complicated and we don't need this

feature

right away so the restriction still exists.

Thanks!
Sagar.


On Wed, Dec 7, 2022 at 9:42 PM Matthias J. Sax 

wrote:



I don't see any update on the wiki about it. Did you forget to hit

"save"?


Can you also provide some background? I am not sure right now if I
understand the proposed changes?


-Matthias

On 12/6/22 6:36 PM, Sophie Blee-Goldman wrote:

Thanks Sagar, this makes sense to me -- we clearly need additional

changes

to
avoid breaking IQ when using this feature, but I agree with

continuing

to

restrict
FKJ since they wouldn't stop working without it, and would become

much

harder
to reason about (than they already are) if we did enable them to use

it.


And of course, they can still multicast the final results of a FKJ,

they

just can't
mess with the internal workings of it in this way.

On Tue, Dec 6, 2022 at 9:48 AM Sagar 

wrote:



Hi All,

I made a couple of edits to the KIP which came up during the code

review.

Changes at a high level are:

1) KeyQueryMetada enhanced to have a new method called

partitions().

2) Lifting the restriction of a single partition for IQ. Now the
restriction holds only for FK Join.

Updated KIP:







[jira] [Created] (KAFKA-14549) Move LogDirFailureChannel to storage module

2022-12-22 Thread Federico Valeri (Jira)
Federico Valeri created KAFKA-14549:
---

 Summary: Move LogDirFailureChannel to storage module
 Key: KAFKA-14549
 URL: https://issues.apache.org/jira/browse/KAFKA-14549
 Project: Kafka
  Issue Type: Sub-task
Reporter: Federico Valeri






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2022-12-22 Thread Nick Telford
Hi everyone,

I've updated the KIP with a more detailed design, which reflects the
implementation I've been working on:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores

This new design should address the outstanding points already made in the
thread.

Please let me know if there are areas that are unclear or need more
clarification.

I have a (nearly) working implementation. I'm confident that the remaining
work (making Segments behave) will not impact the documented design.

Regards,

Nick

On Tue, 6 Dec 2022 at 19:24, Colt McNealy  wrote:

> Nick,
>
> Thank you for the reply; that makes sense. I was hoping that, since reading
> uncommitted records from IQ in EOS isn't part of the documented API, maybe
> you *wouldn't* have to wait for the next major release to make that change;
> but given that it would be considered a major change, I like your approach
> the best.
>
> Wishing you a speedy recovery and happy coding!
>
> Thanks,
> Colt McNealy
> *Founder, LittleHorse.io*
>
>
> On Tue, Dec 6, 2022 at 10:30 AM Nick Telford 
> wrote:
>
> > Hi Colt,
> >
> > 10: Yes, I agree it's not ideal. I originally intended to try to keep the
> > behaviour unchanged as much as possible, otherwise we'd have to wait for
> a
> > major version release to land these changes.
> > 20: Good point, ALOS doesn't need the same level of guarantee, and the
> > typically longer commit intervals would be problematic when reading only
> > "committed" records.
> >
> > I've been away for 5 days recovering from minor surgery, but I spent a
> > considerable amount of that time working through ideas for possible
> > solutions in my head. I think your suggestion of keeping ALOS as-is, but
> > buffering writes for EOS is the right path forwards, although I have a
> > solution that both expands on this, and provides for some more formal
> > guarantees.
> >
> > Essentially, adding support to KeyValueStores for "Transactions", with
> > clearly defined IsolationLevels. Using "Read Committed" when under EOS,
> and
> > "Read Uncommitted" under ALOS.
> >
> > The nice thing about this approach is that it gives us much more clearly
> > defined isolation behaviour that can be properly documented to ensure
> users
> > know what to expect.
> >
> > I'm still working out the kinks in the design, and will update the KIP
> when
> > I have something. The main struggle is trying to implement this without
> > making any major changes to the existing interfaces or breaking existing
> > implementations, because currently everything expects to operate directly
> > on a StateStore, and not a Transaction of that store. I think I'm getting
> > close, although sadly I won't be able to progress much until next week
> due
> > to some work commitments.
> >
> > Regards,
> > Nick
> >
> > On Thu, 1 Dec 2022 at 00:01, Colt McNealy  wrote:
> >
> > > Nick,
> > >
> > > Thank you for the explanation, and also for the updated KIP. I am quite
> > > eager for this improvement to be released as it would greatly reduce
> the
> > > operational difficulties of EOS streams apps.
> > >
> > > Two questions:
> > >
> > > 10)
> > > >When reading records, we will use the
> > > WriteBatchWithIndex#getFromBatchAndDB
> > >  and WriteBatchWithIndex#newIteratorWithBase utilities in order to
> ensure
> > > that uncommitted writes are available to query.
> > > Why do extra work to enable the reading of uncommitted writes during
> IQ?
> > > Code complexity aside, reading uncommitted writes is, in my opinion, a
> > > minor flaw in EOS IQ; it would be very nice to have the guarantee that,
> > > with EOS, IQ only reads committed records. In order to avoid dirty
> reads,
> > > one currently must query a standby replica (but this still doesn't
> fully
> > > guarantee monotonic reads).
> > >
> > > 20) Is it also necessary to enable this optimization on ALOS stores?
> The
> > > motivation of KIP-844 was mainly to reduce the need to restore state
> from
> > > scratch on unclean EOS shutdowns; with ALOS it was acceptable to accept
> > > that there may have been uncommitted writes on disk. On a side note, if
> > you
> > > enable this type of store on ALOS processors, the community would
> > > definitely want to enable queries on dirty reads; otherwise users would
> > > have to wait 30 seconds (default) to see an update.
> > >
> > > Thank you for doing this fantastic work!
> > > Colt McNealy
> > > *Founder, LittleHorse.io*
> > >
> > >
> > > On Wed, Nov 30, 2022 at 10:44 AM Nick Telford 
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I've drastically reduced the scope of this KIP to no longer include
> the
> > > > StateStore management of checkpointing. This can be added as a KIP
> > later
> > > on
> > > > to further optimize the consistency and performance of state stores.
> > > >
> > > > I've also added a section discussing some of the concerns around
> > > > concurrency, especially in the presence of Iterators. I'm thinking of
> > > > wrapping 

[jira] [Created] (KAFKA-14548) Stable streams applications stall due to infrequent restoreConsumer polls

2022-12-22 Thread Greg Harris (Jira)
Greg Harris created KAFKA-14548:
---

 Summary: Stable streams applications stall due to infrequent 
restoreConsumer polls
 Key: KAFKA-14548
 URL: https://issues.apache.org/jira/browse/KAFKA-14548
 Project: Kafka
  Issue Type: Bug
  Components: streams
Reporter: Greg Harris


We have observed behavior with Streams where otherwise healthy applications 
stall and become unable to process data after a rebalance. The root cause of 
which is that a restoreConsumer can be partitioned from a Kafka cluster with 
stale metadata, while the mainConsumer is healthy with up-to-date metadata. 
This is due to both an issue in streams and an issue in the consumer logic.

In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated 
while the streams app is running. This consumer is only `poll()`ed when the 
ChangelogReader::restore method is called and at least one changelog is in the 
RESTORING state. This may be very infrequent if the streams app is stable.

This is an anti-pattern, as frequent poll()s are expected to keep kafka 
consumers in contact with the kafka cluster. Infrequent polls are considered 
failures from the perspective of the consumer API. From the [official Kafka 
Consumer 
documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
{noformat}
The poll API is designed to ensure consumer liveness.
...
So to stay in the group, you must continue to call poll.
...
The recommended way to handle these cases [where the main thread is not ready 
for more data] is to move message processing to another thread, which allows 
the consumer to continue calling poll while the processor is still working.
...
Note also that you will need to pause the partition so that no new records are 
received from poll until after thread has finished handling those previously 
returned.{noformat}
With the current behavior, it is expected that the restoreConsumer will fall 
out of the group regularly and be considered failed, when the rest of the 
application is running exactly as intended.

This is not normally an issue, as falling out of the group is easily repaired 
by joining the group during the next poll. It does mean that there is slightly 
higher latency to performing a restore, but that does not appear to be a major 
concern at this time.

This does become an issue when other deeper assumptions about the usage of 
Kafka clients are violated. Relevant to this issue, it is assumed by the client 
metadata management logic that regular polling will take place, and that the 
regular poll call can be piggy-backed to initiate a metadata update. Without a 
regular poll, the regular metadata update cannot be performed, and the consumer 
violates its own `metadata.max.age.ms` configuration. This leads to the 
restoreConsumer having a much older metadata containing none of the currently 
live brokers, partitioning it from the cluster.

Alleviating this failure mode does not _require_ the streams' polling behavior 
to change, as solutions for all clients have been considered 
(https://issues.apache.org/jira/browse/KAFKA-3068 and that family of duplicate 
issues).

However, as a tactical fix for the issue, and one which does not require a KIP 
changing the behavior of {_}every kafka client{_}, we should consider changing 
the restoreConsumer poll behavior to bring it closer to the expected happy-path 
of at least one poll() every poll.interval.ms.

If there is another hidden assumption of the clients that relies on regular 
polling, then this tactical fix may prevent users of the streams library from 
being affected, reducing the impact of that hidden assumption through 
defense-in-depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14547) Be able to run kafka KRaft Server in tests without needing to run a storage setup script

2022-12-22 Thread Natan Silnitsky (Jira)
Natan Silnitsky created KAFKA-14547:
---

 Summary: Be able to run kafka KRaft Server in tests without 
needing to run a storage setup script
 Key: KAFKA-14547
 URL: https://issues.apache.org/jira/browse/KAFKA-14547
 Project: Kafka
  Issue Type: Improvement
  Components: kraft
Affects Versions: 3.3.1
Reporter: Natan Silnitsky


Currently kafka KRaft Server requires running kafka-storage.sh in order to 
start properly.
This makes setup much more cubersome for build tools like bazel to work 
properly.

One way to mitigate this is to configure the paths via kafkaConfig...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] 3.3.2 RC1

2022-12-22 Thread Divij Vaidya
I did the following release validations -

- Verified that `./gradlew test` is successful on an ARM machine
- Verified that we are able to scrape metrics from JMX port
- Verified that a cluster running two brokers, 3.3.2 and 3.3.1
respectively, produce/consume works for both Zk and kraft

Will wait for system test results before adding +1 vote.

*Minor concerns (not release blockers):*

When I try to create a __cluster_metadata topic, it fails with the expected
Authorization failed but prints a weird log line with a date (see last line
below)

$ kafka-topics.sh --create--replication-factor 1   --partitions 1
--topic __cluster_metadata --bootstrap-server :9092
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/opt/kafka/tools/build/dependant-libs-2.13.8/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/kafka/connect/runtime/build/dependant-libs/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
WARNING: Due to limitations in metric names, topics with a period ('.') or
underscore ('_') could collide. To avoid issues it is best to use either,
but not both.
Error while executing topic command : Authorization failed.
[2022-12-22 12:45:36,242] ERROR
org.apache.kafka.common.errors.TopicAuthorizationException: Authorization
failed.
 (kafka.admin.TopicCommand$)


--
Divij Vaidya



On Thu, Dec 22, 2022 at 7:12 AM Yash Mayya  wrote:

> Hi Chris,
>
> I did the following release validations -
>
> - Verified the MD5 / SHA-1 / SHA-512 checksums and the PGP signatures
> - Built from source using Java 8 and Scala 2.13
> - Ran all the unit tests successfully
> - Ran all the integration tests successfully (couple of flaky failures that
> passed on a rerun - `TopicCommandIntegrationTest.
> testDeleteInternalTopic(String).quorum=kraft` and
> `SaslScramSslEndToEndAuthorizationTest.
> testNoConsumeWithoutDescribeAclViaSubscribe()`)
> - Quickstart for Kafka and Kafka Connect with both ZooKeeper and KRaft
>
> I'm +1 (non-binding) assuming that the system test results look good.
>
> Thanks,
> Yash
>
> On Thu, Dec 22, 2022 at 3:52 AM Chris Egerton 
> wrote:
>
> > Hello Kafka users, developers and client-developers,
> >
> > This is the second candidate for release of Apache Kafka 3.3.2.
> >
> > This is a bugfix release with several fixes since the release of 3.3.1. A
> > few of the major issues include:
> >
> > * KAFKA-14358 Users should not be able to create a regular topic name
> > __cluster_metadata
> > KAFKA-14379 Consumer should refresh preferred read replica on update
> > metadata
> > * KAFKA-13586 Prevent exception thrown during connector update from
> > crashing distributed herder
> >
> >
> > Release notes for the 3.3.2 release:
> > https://home.apache.org/~cegerton/kafka-3.3.2-rc1/RELEASE_NOTES.html
> >
> > *** Please download, test and vote by Friday, January 6, 2023, 10pm UTC
> > (this date is chosen to accommodate the various upcoming holidays that
> > members of the community will be taking and give everyone enough time to
> > test out the release candidate, without unduly delaying the release)
> >
> > Kafka's KEYS file containing PGP keys we use to sign the release:
> > https://kafka.apache.org/KEYS
> >
> > * Release artifacts to be voted upon (source and binary):
> > https://home.apache.org/~cegerton/kafka-3.3.2-rc1/
> >
> > * Maven artifacts to be voted upon:
> > https://repository.apache.org/content/groups/staging/org/apache/kafka/
> >
> > * Javadoc:
> > https://home.apache.org/~cegerton/kafka-3.3.2-rc1/javadoc/
> >
> > * Tag to be voted upon (off 3.3 branch) is the 3.3.2 tag:
> > https://github.com/apache/kafka/releases/tag/3.3.2-rc1
> >
> > * Documentation:
> > https://kafka.apache.org/33/documentation.html
> >
> > * Protocol:
> > https://kafka.apache.org/33/protocol.html
> >
> > The most recent build has had test failures. These all appear to be due
> to
> > flakiness, but it would be nice if someone more familiar with the failed
> > tests could confirm this. I may update this thread with passing build
> links
> > if I can get one, or start a new release vote thread if test failures
> must
> > be addressed beyond re-running builds until they pass.
> >
> > Unit/integration tests:
> > https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.3/142/testReport/
> >
> > José, would it be possible to re-run the system tests for 3.3 on the
> latest
> > commit for 3.3 (e3212f2), and share the results on this thread?
> >
> > Cheers,
> >
> > Chris
> >
>