Re: [QUESTION] What is the difference between sequence and offset for a Record?

2023-08-09 Thread tison
Thanks for your reply!

I may not use "normalization". What I want to refer to is:

appendInfo.setLastOffset(offset.value - 1)

which underneath updates the base offset field (in record batch) but not
the offset delta of each record.

Best,
tison.


Justine Olshan  于2023年8月8日周二 00:43写道:

> The sequence summary looks right to me.
> For log normalization, are you referring to compaction? The segment's first
> and last offsets might change, but a batch keeps its offsets when
> compaction occurs.
>
> Hope that helps.
> Justine
>
> On Mon, Aug 7, 2023 at 8:59 AM Matthias J. Sax  wrote:
>
> > > but the base offset may change during log normalizing.
> >
> > Not sure what you mean by "normalization" but offsets are immutable, so
> > they don't change. (To be fair, I am not an expert on brokers, so not
> > sure how this work in detail when log compaction ticks in).
> >
> > > This field is given by the producer and the broker should only read it.
> >
> > Sounds right. The point being is, that the broker has an "expected"
> > value for it, and if the provided value does not match the expected one,
> > the write is rejected to begin with.
> >
> >
> > -Matthias
> >
> > On 8/7/23 6:35 AM, tison wrote:
> > > Hi Matthias and Justine,
> > >
> > > Thanks for your reply!
> > >
> > > I can summarize the answer as -
> > >
> > > Record offset = base offset + offset delta. This field is calculated by
> > the
> > > broker and the delta won't change but the base offset may change during
> > log
> > > normalizing.
> > > Record sequence = base sequence + (offset) delta. This field is given
> by
> > > the producer and the broker should only read it.
> > >
> > > Is it correct?
> > >
> > > I implement the manipulation part of base offset following this
> > > understanding at [1].
> > >
> > > Best,
> > > tison.
> > >
> > > [1]
> > >
> >
> https://github.com/tisonkun/kafka-api/blob/d080ab7e4b57c0ab0182e0b254333f400e616cd2/simplesrv/src/lib.rs#L391-L394
> > >
> > >
> > > Justine Olshan  于2023年8月2日周三 04:19写道:
> > >
> > >> For what it's worth -- the sequence number is not calculated
> > >> "baseOffset/baseSequence + offset delta" but rather by monotonically
> > >> increasing for a given epoch. If the epoch is bumped, we reset back to
> > >> zero.
> > >> This may mean that the offset and sequence may match, but do not
> > strictly
> > >> need to be the same. The sequence number will also always come from
> the
> > >> client and be in the produce records sent to the Kafka broker.
> > >>
> > >> As for offsets, there is some code in the log layer that maintains the
> > log
> > >> end offset and assigns offsets to the records. The produce handling on
> > the
> > >> leader should typically assign the offset.
> > >> I believe you can find that code here:
> > >>
> > >>
> >
> https://github.com/apache/kafka/blob/b9a45546a7918799b6fb3c0fe63b56f47d8fcba9/core/src/main/scala/kafka/log/UnifiedLog.scala#L766
> > >>
> > >> Justine
> > >>
> > >> On Tue, Aug 1, 2023 at 11:38 AM Matthias J. Sax 
> > wrote:
> > >>
> > >>> The _offset_ is the position of the record in the partition.
> > >>>
> > >>> The _sequence number_ is a unique ID that allows broker to
> de-duplicate
> > >>> messages. It requires the producer to implement the idempotency
> > protocol
> > >>> (part of Kafka transactions); thus, sequence numbers are optional and
> > as
> > >>> long as you don't want to support idempotent writes, you don't need
> to
> > >>> worry about them. (If you want to dig into details, checkout KIP-98
> > that
> > >>> is the original KIP about Kafka TX).
> > >>>
> > >>> HTH,
> > >>> -Matthias
> > >>>
> > >>> On 8/1/23 2:19 AM, tison wrote:
> >  Hi,
> > 
> >  I'm wringing a Kafka API Rust codec library[1] to understand how
> Kafka
> >  models its concepts and how the core business logic works.
> > 
> >  During implementing the codec for Records[2], I saw a twins of
> fields
> >  "sequence" and "offset". Both of them are calculated by
> >  baseOffset/baseSequence + offset delta. Then I'm a bit confused how
> to
> > >>> deal
> >  with them properly - what's the difference between these two
> concepts
> >  logically?
> > 
> >  Also, to understand how the core business logic works, I write a
> > simple
> >  server based on my codec library, and observe that the server may
> need
> > >> to
> >  update offset for records produced. How does Kafka set the correct
> > >> offset
> >  for each produced records? And how does Kafka maintain the
> calculation
> > >>> for
> >  offset and sequence during these modifications?
> > 
> >  I'll appreciate if anyone can answer the question or give some
> > insights
> > >>> :D
> > 
> >  Best,
> >  tison.
> > 
> >  [1] https://github.com/tisonkun/kafka-api
> >  [2] https://kafka.apache.org/documentation/#messageformat
> > 
> > >>>
> > >>
> > >
> >
>


Re: [QUESTION] What is the difference between sequence and offset for a Record?

2023-08-07 Thread Justine Olshan
The sequence summary looks right to me.
For log normalization, are you referring to compaction? The segment's first
and last offsets might change, but a batch keeps its offsets when
compaction occurs.

Hope that helps.
Justine

On Mon, Aug 7, 2023 at 8:59 AM Matthias J. Sax  wrote:

> > but the base offset may change during log normalizing.
>
> Not sure what you mean by "normalization" but offsets are immutable, so
> they don't change. (To be fair, I am not an expert on brokers, so not
> sure how this work in detail when log compaction ticks in).
>
> > This field is given by the producer and the broker should only read it.
>
> Sounds right. The point being is, that the broker has an "expected"
> value for it, and if the provided value does not match the expected one,
> the write is rejected to begin with.
>
>
> -Matthias
>
> On 8/7/23 6:35 AM, tison wrote:
> > Hi Matthias and Justine,
> >
> > Thanks for your reply!
> >
> > I can summarize the answer as -
> >
> > Record offset = base offset + offset delta. This field is calculated by
> the
> > broker and the delta won't change but the base offset may change during
> log
> > normalizing.
> > Record sequence = base sequence + (offset) delta. This field is given by
> > the producer and the broker should only read it.
> >
> > Is it correct?
> >
> > I implement the manipulation part of base offset following this
> > understanding at [1].
> >
> > Best,
> > tison.
> >
> > [1]
> >
> https://github.com/tisonkun/kafka-api/blob/d080ab7e4b57c0ab0182e0b254333f400e616cd2/simplesrv/src/lib.rs#L391-L394
> >
> >
> > Justine Olshan  于2023年8月2日周三 04:19写道:
> >
> >> For what it's worth -- the sequence number is not calculated
> >> "baseOffset/baseSequence + offset delta" but rather by monotonically
> >> increasing for a given epoch. If the epoch is bumped, we reset back to
> >> zero.
> >> This may mean that the offset and sequence may match, but do not
> strictly
> >> need to be the same. The sequence number will also always come from the
> >> client and be in the produce records sent to the Kafka broker.
> >>
> >> As for offsets, there is some code in the log layer that maintains the
> log
> >> end offset and assigns offsets to the records. The produce handling on
> the
> >> leader should typically assign the offset.
> >> I believe you can find that code here:
> >>
> >>
> https://github.com/apache/kafka/blob/b9a45546a7918799b6fb3c0fe63b56f47d8fcba9/core/src/main/scala/kafka/log/UnifiedLog.scala#L766
> >>
> >> Justine
> >>
> >> On Tue, Aug 1, 2023 at 11:38 AM Matthias J. Sax 
> wrote:
> >>
> >>> The _offset_ is the position of the record in the partition.
> >>>
> >>> The _sequence number_ is a unique ID that allows broker to de-duplicate
> >>> messages. It requires the producer to implement the idempotency
> protocol
> >>> (part of Kafka transactions); thus, sequence numbers are optional and
> as
> >>> long as you don't want to support idempotent writes, you don't need to
> >>> worry about them. (If you want to dig into details, checkout KIP-98
> that
> >>> is the original KIP about Kafka TX).
> >>>
> >>> HTH,
> >>> -Matthias
> >>>
> >>> On 8/1/23 2:19 AM, tison wrote:
>  Hi,
> 
>  I'm wringing a Kafka API Rust codec library[1] to understand how Kafka
>  models its concepts and how the core business logic works.
> 
>  During implementing the codec for Records[2], I saw a twins of fields
>  "sequence" and "offset". Both of them are calculated by
>  baseOffset/baseSequence + offset delta. Then I'm a bit confused how to
> >>> deal
>  with them properly - what's the difference between these two concepts
>  logically?
> 
>  Also, to understand how the core business logic works, I write a
> simple
>  server based on my codec library, and observe that the server may need
> >> to
>  update offset for records produced. How does Kafka set the correct
> >> offset
>  for each produced records? And how does Kafka maintain the calculation
> >>> for
>  offset and sequence during these modifications?
> 
>  I'll appreciate if anyone can answer the question or give some
> insights
> >>> :D
> 
>  Best,
>  tison.
> 
>  [1] https://github.com/tisonkun/kafka-api
>  [2] https://kafka.apache.org/documentation/#messageformat
> 
> >>>
> >>
> >
>


Re: [QUESTION] What is the difference between sequence and offset for a Record?

2023-08-07 Thread Matthias J. Sax

but the base offset may change during log normalizing.


Not sure what you mean by "normalization" but offsets are immutable, so 
they don't change. (To be fair, I am not an expert on brokers, so not 
sure how this work in detail when log compaction ticks in).



This field is given by the producer and the broker should only read it.


Sounds right. The point being is, that the broker has an "expected" 
value for it, and if the provided value does not match the expected one, 
the write is rejected to begin with.



-Matthias

On 8/7/23 6:35 AM, tison wrote:

Hi Matthias and Justine,

Thanks for your reply!

I can summarize the answer as -

Record offset = base offset + offset delta. This field is calculated by the
broker and the delta won't change but the base offset may change during log
normalizing.
Record sequence = base sequence + (offset) delta. This field is given by
the producer and the broker should only read it.

Is it correct?

I implement the manipulation part of base offset following this
understanding at [1].

Best,
tison.

[1]
https://github.com/tisonkun/kafka-api/blob/d080ab7e4b57c0ab0182e0b254333f400e616cd2/simplesrv/src/lib.rs#L391-L394


Justine Olshan  于2023年8月2日周三 04:19写道:


For what it's worth -- the sequence number is not calculated
"baseOffset/baseSequence + offset delta" but rather by monotonically
increasing for a given epoch. If the epoch is bumped, we reset back to
zero.
This may mean that the offset and sequence may match, but do not strictly
need to be the same. The sequence number will also always come from the
client and be in the produce records sent to the Kafka broker.

As for offsets, there is some code in the log layer that maintains the log
end offset and assigns offsets to the records. The produce handling on the
leader should typically assign the offset.
I believe you can find that code here:

https://github.com/apache/kafka/blob/b9a45546a7918799b6fb3c0fe63b56f47d8fcba9/core/src/main/scala/kafka/log/UnifiedLog.scala#L766

Justine

On Tue, Aug 1, 2023 at 11:38 AM Matthias J. Sax  wrote:


The _offset_ is the position of the record in the partition.

The _sequence number_ is a unique ID that allows broker to de-duplicate
messages. It requires the producer to implement the idempotency protocol
(part of Kafka transactions); thus, sequence numbers are optional and as
long as you don't want to support idempotent writes, you don't need to
worry about them. (If you want to dig into details, checkout KIP-98 that
is the original KIP about Kafka TX).

HTH,
-Matthias

On 8/1/23 2:19 AM, tison wrote:

Hi,

I'm wringing a Kafka API Rust codec library[1] to understand how Kafka
models its concepts and how the core business logic works.

During implementing the codec for Records[2], I saw a twins of fields
"sequence" and "offset". Both of them are calculated by
baseOffset/baseSequence + offset delta. Then I'm a bit confused how to

deal

with them properly - what's the difference between these two concepts
logically?

Also, to understand how the core business logic works, I write a simple
server based on my codec library, and observe that the server may need

to

update offset for records produced. How does Kafka set the correct

offset

for each produced records? And how does Kafka maintain the calculation

for

offset and sequence during these modifications?

I'll appreciate if anyone can answer the question or give some insights

:D


Best,
tison.

[1] https://github.com/tisonkun/kafka-api
[2] https://kafka.apache.org/documentation/#messageformat









Re: [QUESTION] What is the difference between sequence and offset for a Record?

2023-08-07 Thread tison
Hi Matthias and Justine,

Thanks for your reply!

I can summarize the answer as -

Record offset = base offset + offset delta. This field is calculated by the
broker and the delta won't change but the base offset may change during log
normalizing.
Record sequence = base sequence + (offset) delta. This field is given by
the producer and the broker should only read it.

Is it correct?

I implement the manipulation part of base offset following this
understanding at [1].

Best,
tison.

[1]
https://github.com/tisonkun/kafka-api/blob/d080ab7e4b57c0ab0182e0b254333f400e616cd2/simplesrv/src/lib.rs#L391-L394


Justine Olshan  于2023年8月2日周三 04:19写道:

> For what it's worth -- the sequence number is not calculated
> "baseOffset/baseSequence + offset delta" but rather by monotonically
> increasing for a given epoch. If the epoch is bumped, we reset back to
> zero.
> This may mean that the offset and sequence may match, but do not strictly
> need to be the same. The sequence number will also always come from the
> client and be in the produce records sent to the Kafka broker.
>
> As for offsets, there is some code in the log layer that maintains the log
> end offset and assigns offsets to the records. The produce handling on the
> leader should typically assign the offset.
> I believe you can find that code here:
>
> https://github.com/apache/kafka/blob/b9a45546a7918799b6fb3c0fe63b56f47d8fcba9/core/src/main/scala/kafka/log/UnifiedLog.scala#L766
>
> Justine
>
> On Tue, Aug 1, 2023 at 11:38 AM Matthias J. Sax  wrote:
>
> > The _offset_ is the position of the record in the partition.
> >
> > The _sequence number_ is a unique ID that allows broker to de-duplicate
> > messages. It requires the producer to implement the idempotency protocol
> > (part of Kafka transactions); thus, sequence numbers are optional and as
> > long as you don't want to support idempotent writes, you don't need to
> > worry about them. (If you want to dig into details, checkout KIP-98 that
> > is the original KIP about Kafka TX).
> >
> > HTH,
> >-Matthias
> >
> > On 8/1/23 2:19 AM, tison wrote:
> > > Hi,
> > >
> > > I'm wringing a Kafka API Rust codec library[1] to understand how Kafka
> > > models its concepts and how the core business logic works.
> > >
> > > During implementing the codec for Records[2], I saw a twins of fields
> > > "sequence" and "offset". Both of them are calculated by
> > > baseOffset/baseSequence + offset delta. Then I'm a bit confused how to
> > deal
> > > with them properly - what's the difference between these two concepts
> > > logically?
> > >
> > > Also, to understand how the core business logic works, I write a simple
> > > server based on my codec library, and observe that the server may need
> to
> > > update offset for records produced. How does Kafka set the correct
> offset
> > > for each produced records? And how does Kafka maintain the calculation
> > for
> > > offset and sequence during these modifications?
> > >
> > > I'll appreciate if anyone can answer the question or give some insights
> > :D
> > >
> > > Best,
> > > tison.
> > >
> > > [1] https://github.com/tisonkun/kafka-api
> > > [2] https://kafka.apache.org/documentation/#messageformat
> > >
> >
>


Re: [QUESTION] What is the difference between sequence and offset for a Record?

2023-08-01 Thread Justine Olshan
For what it's worth -- the sequence number is not calculated
"baseOffset/baseSequence + offset delta" but rather by monotonically
increasing for a given epoch. If the epoch is bumped, we reset back to zero.
This may mean that the offset and sequence may match, but do not strictly
need to be the same. The sequence number will also always come from the
client and be in the produce records sent to the Kafka broker.

As for offsets, there is some code in the log layer that maintains the log
end offset and assigns offsets to the records. The produce handling on the
leader should typically assign the offset.
I believe you can find that code here:
https://github.com/apache/kafka/blob/b9a45546a7918799b6fb3c0fe63b56f47d8fcba9/core/src/main/scala/kafka/log/UnifiedLog.scala#L766

Justine

On Tue, Aug 1, 2023 at 11:38 AM Matthias J. Sax  wrote:

> The _offset_ is the position of the record in the partition.
>
> The _sequence number_ is a unique ID that allows broker to de-duplicate
> messages. It requires the producer to implement the idempotency protocol
> (part of Kafka transactions); thus, sequence numbers are optional and as
> long as you don't want to support idempotent writes, you don't need to
> worry about them. (If you want to dig into details, checkout KIP-98 that
> is the original KIP about Kafka TX).
>
> HTH,
>-Matthias
>
> On 8/1/23 2:19 AM, tison wrote:
> > Hi,
> >
> > I'm wringing a Kafka API Rust codec library[1] to understand how Kafka
> > models its concepts and how the core business logic works.
> >
> > During implementing the codec for Records[2], I saw a twins of fields
> > "sequence" and "offset". Both of them are calculated by
> > baseOffset/baseSequence + offset delta. Then I'm a bit confused how to
> deal
> > with them properly - what's the difference between these two concepts
> > logically?
> >
> > Also, to understand how the core business logic works, I write a simple
> > server based on my codec library, and observe that the server may need to
> > update offset for records produced. How does Kafka set the correct offset
> > for each produced records? And how does Kafka maintain the calculation
> for
> > offset and sequence during these modifications?
> >
> > I'll appreciate if anyone can answer the question or give some insights
> :D
> >
> > Best,
> > tison.
> >
> > [1] https://github.com/tisonkun/kafka-api
> > [2] https://kafka.apache.org/documentation/#messageformat
> >
>


Re: [QUESTION] What is the difference between sequence and offset for a Record?

2023-08-01 Thread Matthias J. Sax

The _offset_ is the position of the record in the partition.

The _sequence number_ is a unique ID that allows broker to de-duplicate 
messages. It requires the producer to implement the idempotency protocol 
(part of Kafka transactions); thus, sequence numbers are optional and as 
long as you don't want to support idempotent writes, you don't need to 
worry about them. (If you want to dig into details, checkout KIP-98 that 
is the original KIP about Kafka TX).


HTH,
  -Matthias

On 8/1/23 2:19 AM, tison wrote:

Hi,

I'm wringing a Kafka API Rust codec library[1] to understand how Kafka
models its concepts and how the core business logic works.

During implementing the codec for Records[2], I saw a twins of fields
"sequence" and "offset". Both of them are calculated by
baseOffset/baseSequence + offset delta. Then I'm a bit confused how to deal
with them properly - what's the difference between these two concepts
logically?

Also, to understand how the core business logic works, I write a simple
server based on my codec library, and observe that the server may need to
update offset for records produced. How does Kafka set the correct offset
for each produced records? And how does Kafka maintain the calculation for
offset and sequence during these modifications?

I'll appreciate if anyone can answer the question or give some insights :D

Best,
tison.

[1] https://github.com/tisonkun/kafka-api
[2] https://kafka.apache.org/documentation/#messageformat



[QUESTION] What is the difference between sequence and offset for a Record?

2023-08-01 Thread tison
Hi,

I'm wringing a Kafka API Rust codec library[1] to understand how Kafka
models its concepts and how the core business logic works.

During implementing the codec for Records[2], I saw a twins of fields
"sequence" and "offset". Both of them are calculated by
baseOffset/baseSequence + offset delta. Then I'm a bit confused how to deal
with them properly - what's the difference between these two concepts
logically?

Also, to understand how the core business logic works, I write a simple
server based on my codec library, and observe that the server may need to
update offset for records produced. How does Kafka set the correct offset
for each produced records? And how does Kafka maintain the calculation for
offset and sequence during these modifications?

I'll appreciate if anyone can answer the question or give some insights :D

Best,
tison.

[1] https://github.com/tisonkun/kafka-api
[2] https://kafka.apache.org/documentation/#messageformat