Re: [DISCUSS] KIP-31 - Message format change proposal

Joel Koshy Thu, 24 Sep 2015 18:33:04 -0700

The upgrade plan works, but the potentially long interim phase of
skipping zero-copy for down-conversion could be problematic especially
for large deployments with large consumer fan-out. It is not only
going to be memory overhead but CPU as well - since you need to
decompress, write absolute offsets, then recompress for every v1
fetch. i.e., it may be safer (but obviously more tedious) to have a
multi-step upgrade process. For e.g.,:


1 - Upgrade brokers, but disable the feature. i.e., either reject
producer requests v2 or down-convert to old message format (with
absolute offsets)
2 - Upgrade clients, but they should only use v1 requests
3 - Switch (all or most) consumers to use v2 fetch format (which will
use zero-copy).
4 - Turn on the feature on the brokers to allow producer requests v2
5 - Switch producers to use v2 produce format

(You may want a v1 fetch rate metric and decide to proceed to step 4
only when that comes down to a trickle)

I'm not sure if the prolonged upgrade process is viable in every
scenario. I think it should work at LinkedIn for e.g., but may not for
other environments.

Joel


On Tue, Sep 22, 2015 at 12:55 AM, Jiangjie Qin
<j...@linkedin.com.invalid> wrote:
> Thanks for the explanation, Jay.
> Agreed. We have to keep the offset to be the offset of last inner message.
>
> Jiangjie (Becket) Qin
>
> On Mon, Sep 21, 2015 at 6:21 PM, Jay Kreps <j...@confluent.io> wrote:
>
>> For (3) I don't think we can change the offset in the outer message from
>> what it is today as it is relied upon in the search done in the log layer.
>> The reason it is the offset of the last message rather than the first is to
>> make the offset a least upper bound (i.e. the smallest offset >=
>> fetch_offset). This needs to work the same for both gaps due to compacted
>> topics and gaps due to compressed messages.
>>
>> So imagine you had a compressed set with offsets {45, 46, 47, 48} if you
>> assigned this compressed set the offset 45 a fetch for 46 would actually
>> skip ahead to 49 (the least upper bound).
>>
>> -Jay
>>
>> On Mon, Sep 21, 2015 at 5:17 PM, Jun Rao <j...@confluent.io> wrote:
>>
>> > Jiangjie,
>> >
>> > Thanks for the writeup. A few comments below.
>> >
>> > 1. We will need to be a bit careful with fetch requests from the
>> followers.
>> > Basically, as we are doing a rolling upgrade of the brokers, the follower
>> > can't start issuing V2 of the fetch request until the rest of the brokers
>> > are ready to process it. So, we probably need to make use of
>> > inter.broker.protocol.version to do the rolling upgrade. In step 1, we
>> set
>> > inter.broker.protocol.version to 0.9 and do a round of rolling upgrade of
>> > the brokers. At this point, all brokers are capable of processing V2 of
>> > fetch requests, but no broker is using it yet. In step 2, we
>> > set inter.broker.protocol.version to 0.10 and do another round of rolling
>> > restart of the brokers. In this step, the upgraded brokers will start
>> > issuing V2 of the fetch request.
>> >
>> > 2. If we do #1, I am not sure if there is still a need for
>> > message.format.version since the broker can start writing messages in the
>> > new format after inter.broker.protocol.version is set to 0.10.
>> >
>> > 3. It wasn't clear from the wiki whether the base offset in the shallow
>> > message is the offset of the first or the last inner message. It's better
>> > to use the offset of the last inner message. This way, the followers
>> don't
>> > have to decompress messages to figure out the next fetch offset.
>> >
>> > 4. I am not sure that I understand the following sentence in the wiki. It
>> > seems that the relative offsets in a compressed message don't have to be
>> > consecutive. If so, why do we need to update the relative offsets in the
>> > inner messages?
>> > "When the log cleaner compacts log segments, it needs to update the inner
>> > message's relative offset values."
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > On Thu, Sep 17, 2015 at 12:54 PM, Jiangjie Qin <j...@linkedin.com.invalid
>> >
>> > wrote:
>> >
>> > > Hi folks,
>> > >
>> > > Thanks a lot for the feedback on KIP-31 - move to use relative offset.
>> > (Not
>> > > including timestamp and index discussion).
>> > >
>> > > I updated the migration plan section as we discussed on KIP hangout. I
>> > > think it is the only concern raised so far. Please let me know if there
>> > are
>> > > further comments about the KIP.
>> > >
>> > > Thanks,
>> > >
>> > > Jiangjie (Becket) Qin
>> > >
>> > > On Mon, Sep 14, 2015 at 5:13 PM, Jiangjie Qin <j...@linkedin.com>
>> wrote:
>> > >
>> > > > I just updated the KIP-33 to explain the indexing on CreateTime and
>> > > > LogAppendTime respectively. I also used some use case to compare the
>> > two
>> > > > solutions.
>> > > > Although this is for KIP-33, but it does give a some insights on
>> > whether
>> > > > it makes sense to have a per message LogAppendTime.
>> > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
>> > > >
>> > > > As a short summary of the conclusions we have already reached on
>> > > timestamp:
>> > > > 1. It is good to add a timestamp to the message.
>> > > > 2. LogAppendTime should be used for broker policy enforcement (Log
>> > > > retention / rolling)
>> > > > 3. It is useful to have a CreateTime in message format, which is
>> > > immutable
>> > > > after producer sends the message.
>> > > >
>> > > > There are following questions still in discussion:
>> > > > 1. Should we also add LogAppendTime to message format?
>> > > > 2. which timestamp should we use to build the index.
>> > > >
>> > > > Let's talk about question 1 first because question 2 is actually a
>> > follow
>> > > > up question for question 1.
>> > > > Here are what I think:
>> > > > 1a. To enforce broker log policy, theoretically we don't need
>> > per-message
>> > > > LogAppendTime. If we don't include LogAppendTime in message, we still
>> > > need
>> > > > to implement a separate solution to pass log segment timestamps among
>> > > > brokers. That means if we don't include the LogAppendTime in message,
>> > > there
>> > > > will be further complication in replication.
>> > > > 1b. LogAppendTime has some advantage over CreateTime (KIP-33 has
>> detail
>> > > > comparison)
>> > > > 1c. We have already exposed offset, which is essentially an internal
>> > > > concept of message in terms of position. Exposing LogAppendTime means
>> > we
>> > > > expose another internal concept of message in terms of time.
>> > > >
>> > > > Considering the above reasons, personally I think it worth adding the
>> > > > LogAppendTime to each message.
>> > > >
>> > > > Any thoughts?
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jiangjie (Becket) Qin
>> > > >
>> > > > On Mon, Sep 14, 2015 at 11:44 AM, Jiangjie Qin <j...@linkedin.com>
>> > > wrote:
>> > > >
>> > > >> I was trying to send last email before KIP hangout so maybe did not
>> > > think
>> > > >> it through completely. By the way, the discussion is actually more
>> > > related
>> > > >> to KIP-33, i.e. whether we should index on CreateTime or
>> > LogAppendTime.
>> > > >> (Although it seems all the discussion are still in this mailing
>> > > thread...)
>> > > >> This solution in last email is for indexing on CreateTime. It is
>> > > >> essentially what Jay suggested except we use a timestamp map instead
>> > of
>> > > a
>> > > >> memory mapped index file. Please ignore the proposal of using a log
>> > > >> compacted topic. The solution can be simplified to:
>> > > >>
>> > > >> Each broker keeps
>> > > >> 1. a timestamp index map - Map[TopicPartitionSegment, Map[Timestamp,
>> > > >> Offset]]. The timestamp is on minute boundary.
>> > > >> 2. A timestamp index file for each segment.
>> > > >> When a broker receives a message (both leader or follower), it
>> checks
>> > if
>> > > >> the timestamp index map contains the timestamp for current segment.
>> > The
>> > > >> broker add the offset to the map and append an entry to the
>> timestamp
>> > > index
>> > > >> if the timestamp does not exist. i.e. we only use the index file as
>> a
>> > > >> persistent copy of the index timestamp map.
>> > > >>
>> > > >> When a log segment is deleted, we need to:
>> > > >> 1. delete the TopicPartitionKeySegment key in the timestamp index
>> map.
>> > > >> 2. delete the timestamp index file
>> > > >>
>> > > >> This solution assumes we only keep CreateTime in the message. There
>> > are
>> > > a
>> > > >> few trade-offs in this solution:
>> > > >> 1. The granularity of search will be per minute.
>> > > >> 2. All the timestamp index map has to be in the memory all the time.
>> > > >> 3. We need to think about another way to honor log retention time
>> and
>> > > >> time-based log rolling.
>> > > >> 4. We lose the benefit brought by including LogAppendTime in the
>> > message
>> > > >> mentioned earlier.
>> > > >>
>> > > >> I am not sure whether this solution is necessarily better than
>> > indexing
>> > > >> on LogAppendTime.
>> > > >>
>> > > >> I will update KIP-33 to explain the solution to index on CreateTime
>> > and
>> > > >> LogAppendTime respectively and put some more concrete use cases as
>> > well.
>> > > >>
>> > > >> Thanks,
>> > > >>
>> > > >> Jiangjie (Becket) Qin
>> > > >>
>> > > >>
>> > > >> On Mon, Sep 14, 2015 at 9:40 AM, Jiangjie Qin <j...@linkedin.com>
>> > > wrote:
>> > > >>
>> > > >>> Hi Joel,
>> > > >>>
>> > > >>> Good point about rebuilding index. I agree that having a per
>> message
>> > > >>> LogAppendTime might be necessary. About time adjustment, the
>> solution
>> > > >>> sounds promising, but it might be better to make it as a follow up
>> of
>> > > the
>> > > >>> KIP because it seems a really rare use case.
>> > > >>>
>> > > >>> I have another thought on how to manage the out of order
>> timestamps.
>> > > >>> Maybe we can do the following:
>> > > >>> Create a special log compacted topic __timestamp_index similar to
>> > > topic,
>> > > >>> the key would be (TopicPartition, TimeStamp_Rounded_To_Minute), the
>> > > value
>> > > >>> is offset. In memory, we keep a map for each TopicPartition, the
>> > value
>> > > is
>> > > >>> (timestamp_rounded_to_minute -> smallest_offset_in_the_minute).
>> This
>> > > way we
>> > > >>> can search out of order message and make sure no message is
>> missing.
>> > > >>>
>> > > >>> Thoughts?
>> > > >>>
>> > > >>> Thanks,
>> > > >>>
>> > > >>> Jiangjie (Becket) Qin
>> > > >>>
>> > > >>> On Fri, Sep 11, 2015 at 12:46 PM, Joel Koshy <jjkosh...@gmail.com>
>> > > >>> wrote:
>> > > >>>
>> > > >>>> Jay had mentioned the scenario of mirror-maker bootstrap which
>> would
>> > > >>>> effectively reset the logAppendTimestamps for the bootstrapped
>> data.
>> > > >>>> If we don't include logAppendTimestamps in each message there is a
>> > > >>>> similar scenario when rebuilding indexes during recovery. So it
>> > seems
>> > > >>>> it may be worth adding that timestamp to messages. The drawback to
>> > > >>>> that is exposing a server-side concept in the protocol (although
>> we
>> > > >>>> already do that with offsets). logAppendTimestamp really should be
>> > > >>>> decided by the broker so I think the first scenario may have to be
>> > > >>>> written off as a gotcha, but the second may be worth addressing
>> (by
>> > > >>>> adding it to the message format).
>> > > >>>>
>> > > >>>> The other point that Jay raised which needs to be addressed (since
>> > we
>> > > >>>> require monotically increasing timestamps in the index) in the
>> > > >>>> proposal is changing time on the server (I'm a little less
>> concerned
>> > > >>>> about NTP clock skews than a user explicitly changing the server's
>> > > >>>> time - i.e., big clock skews). We would at least want to "set
>> back"
>> > > >>>> all the existing timestamps to guarantee non-decreasing timestamps
>> > > >>>> with future messages. I'm not sure at this point how best to
>> handle
>> > > >>>> that, but we could perhaps have a epoch/base-time (or
>> > time-correction)
>> > > >>>> stored in the log directories and base all log index timestamps
>> off
>> > > >>>> that base-time (or corrected). So if at any time you determine
>> that
>> > > >>>> time has changed backwards you can adjust that base-time without
>> > > >>>> having to fix up all the entries. Without knowing the exact diff
>> > > >>>> between the previous clock and new clock we cannot adjust the
>> times
>> > > >>>> exactly, but we can at least ensure increasing timestamps.
>> > > >>>>
>> > > >>>> On Fri, Sep 11, 2015 at 10:52 AM, Jiangjie Qin
>> > > >>>> <j...@linkedin.com.invalid> wrote:
>> > > >>>> > Ewen and Jay,
>> > > >>>> >
>> > > >>>> > They way I see the LogAppendTime is another format of "offset".
>> It
>> > > >>>> serves
>> > > >>>> > the following purpose:
>> > > >>>> > 1. Locate messages not only by position, but also by time. The
>> > > >>>> difference
>> > > >>>> > from offset is timestamp is not unique for all messags.
>> > > >>>> > 2. Allow broker to manage messages based on time, e.g.
>> retention,
>> > > >>>> rolling
>> > > >>>> > 3. Provide convenience for user to search message not only by
>> > > offset,
>> > > >>>> but
>> > > >>>> > also by timestamp.
>> > > >>>> >
>> > > >>>> > For purpose (2) we don't need per message server timestamp. We
>> > only
>> > > >>>> need
>> > > >>>> > per log segment server timestamp and propagate it among brokers.
>> > > >>>> >
>> > > >>>> > For (1) and (3), we need per message timestamp. Then the
>> question
>> > is
>> > > >>>> > whether we should use CreateTime or LogAppendTime?
>> > > >>>> >
>> > > >>>> > I completely agree that an application timestamp is very useful
>> > for
>> > > >>>> many
>> > > >>>> > use cases. But it seems to me that having Kafka to understand
>> and
>> > > >>>> maintain
>> > > >>>> > application timestamp is a bit over demanding. So I think there
>> is
>> > > >>>> value to
>> > > >>>> > pass on CreateTime for application convenience, but I am not
>> sure
>> > it
>> > > >>>> can
>> > > >>>> > replace LogAppendTime. Managing out-of-order CreateTime is
>> > > equivalent
>> > > >>>> to
>> > > >>>> > allowing producer to send their own offset and ask broker to
>> > manage
>> > > >>>> the
>> > > >>>> > offset for them, It is going to be very hard to maintain and
>> could
>> > > >>>> create
>> > > >>>> > huge performance/functional issue because of complicated logic.
>> > > >>>> >
>> > > >>>> > About whether we should expose LogAppendTime to broker, I agree
>> > that
>> > > >>>> server
>> > > >>>> > timestamp is internal to broker, but isn't offset also an
>> internal
>> > > >>>> concept?
>> > > >>>> > Arguably it's not provided by producer so consumer application
>> > logic
>> > > >>>> does
>> > > >>>> > not have to know offset. But user needs to know offset because
>> > they
>> > > >>>> need to
>> > > >>>> > know "where is the message" in the log. LogAppendTime provides
>> the
>> > > >>>> answer
>> > > >>>> > of "When was the message appended" to the log. So personally I
>> > think
>> > > >>>> it is
>> > > >>>> > reasonable to expose the LogAppendTime to consumers.
>> > > >>>> >
>> > > >>>> > I can see some use cases of exposing the LogAppendTime, to name
>> > > some:
>> > > >>>> > 1. Let's say broker has 7 days of log retention, some
>> application
>> > > >>>> wants to
>> > > >>>> > reprocess the data in past 3 days. User can simply provide the
>> > > >>>> timestamp
>> > > >>>> > and start consume.
>> > > >>>> > 2. User can easily know lag by time.
>> > > >>>> > 3. Cross cluster fail over. This is a more complicated use case,
>> > > >>>> there are
>> > > >>>> > two goals: 1) Not lose message; and 2) do not reconsume tons of
>> > > >>>> messages.
>> > > >>>> > Only knowing offset of cluster A won't help with finding fail
>> over
>> > > >>>> point in
>> > > >>>> > cluster B  because an offset of a cluster means nothing to
>> another
>> > > >>>> cluster.
>> > > >>>> > Timestamp however is a good cross cluster reference in this
>> case.
>> > > >>>> >
>> > > >>>> > Thanks,
>> > > >>>> >
>> > > >>>> > Jiangjie (Becket) Qin
>> > > >>>> >
>> > > >>>> > On Thu, Sep 10, 2015 at 9:28 PM, Ewen Cheslack-Postava <
>> > > >>>> e...@confluent.io>
>> > > >>>> > wrote:
>> > > >>>> >
>> > > >>>> >> Re: MM preserving timestamps: Yes, this was how I interpreted
>> the
>> > > >>>> point in
>> > > >>>> >> the KIP and I only raised the issue because it restricts the
>> > > >>>> usefulness of
>> > > >>>> >> timestamps anytime MM is involved. I agree it's not a deal
>> > breaker,
>> > > >>>> but I
>> > > >>>> >> wanted to understand exact impact of the change. Some users
>> seem
>> > to
>> > > >>>> want to
>> > > >>>> >> be able to seek by application-defined timestamps (despite the
>> > many
>> > > >>>> obvious
>> > > >>>> >> issues involved), and the proposal clearly would not support
>> that
>> > > >>>> unless
>> > > >>>> >> the timestamps submitted with the produce requests were
>> > respected.
>> > > >>>> If we
>> > > >>>> >> ignore client submitted timestamps, then we probably want to
>> try
>> > to
>> > > >>>> hide
>> > > >>>> >> the timestamps as much as possible in any public interface
>> (e.g.
>> > > >>>> never
>> > > >>>> >> shows up in any public consumer APIs), but expose it just
>> enough
>> > to
>> > > >>>> be
>> > > >>>> >> useful for operational purposes.
>> > > >>>> >>
>> > > >>>> >> Sorry if my devil's advocate position / attempt to map the
>> design
>> > > >>>> space led
>> > > >>>> >> to some confusion!
>> > > >>>> >>
>> > > >>>> >> -Ewen
>> > > >>>> >>
>> > > >>>> >>
>> > > >>>> >> On Thu, Sep 10, 2015 at 5:48 PM, Jay Kreps <j...@confluent.io>
>> > > wrote:
>> > > >>>> >>
>> > > >>>> >> > Ah, I see, I think I misunderstood about MM, it was called
>> out
>> > in
>> > > >>>> the
>> > > >>>> >> > proposal and I thought you were saying you'd retain the
>> > timestamp
>> > > >>>> but I
>> > > >>>> >> > think you're calling out that you're not. In that case you do
>> > > have
>> > > >>>> the
>> > > >>>> >> > opposite problem, right? When you add mirroring for a topic
>> all
>> > > >>>> that data
>> > > >>>> >> > will have a timestamp of now and retention won't be right.
>> Not
>> > a
>> > > >>>> blocker
>> > > >>>> >> > but a bit of a gotcha.
>> > > >>>> >> >
>> > > >>>> >> > -Jay
>> > > >>>> >> >
>> > > >>>> >> >
>> > > >>>> >> >
>> > > >>>> >> > On Thu, Sep 10, 2015 at 5:40 PM, Joel Koshy <
>> > jjkosh...@gmail.com
>> > > >
>> > > >>>> wrote:
>> > > >>>> >> >
>> > > >>>> >> > > > Don't you see all the same issues you see with
>> > client-defined
>> > > >>>> >> > timestamp's
>> > > >>>> >> > > > if you let mm control the timestamp as you were
>> proposing?
>> > > >>>> That means
>> > > >>>> >> > > time
>> > > >>>> >> > >
>> > > >>>> >> > > Actually I don't think that was in the proposal (or was
>> it?).
>> > > >>>> i.e., I
>> > > >>>> >> > > think it was always supposed to be controlled by the broker
>> > > (and
>> > > >>>> not
>> > > >>>> >> > > MM).
>> > > >>>> >> > >
>> > > >>>> >> > > > Also, Joel, can you just confirm that you guys have
>> talked
>> > > >>>> through
>> > > >>>> >> the
>> > > >>>> >> > > > whole timestamp thing with the Samza folks at LI? The
>> > reason
>> > > I
>> > > >>>> ask
>> > > >>>> >> > about
>> > > >>>> >> > > > this is that Samza and Kafka Streams (KIP-28) are both
>> > trying
>> > > >>>> to rely
>> > > >>>> >> > on
>> > > >>>> >> > >
>> > > >>>> >> > > We have not. This is a good point - we will follow-up.
>> > > >>>> >> > >
>> > > >>>> >> > > > WRT your idea of a FollowerFetchRequestI had thought of a
>> > > >>>> similar
>> > > >>>> >> idea
>> > > >>>> >> > > > where we use the leader's timestamps to approximately set
>> > the
>> > > >>>> >> > follower's
>> > > >>>> >> > > > timestamps. I had thought of just adding a partition
>> > metadata
>> > > >>>> request
>> > > >>>> >> > > that
>> > > >>>> >> > > > would subsume the current offset/time lookup and could be
>> > > used
>> > > >>>> by the
>> > > >>>> >> > > > follower to try to approximately keep their timestamps
>> > > kosher.
>> > > >>>> It's a
>> > > >>>> >> > > > little hacky and doesn't help with MM but it is also
>> maybe
>> > > less
>> > > >>>> >> > invasive
>> > > >>>> >> > > so
>> > > >>>> >> > > > that approach could be viable.
>> > > >>>> >> > >
>> > > >>>> >> > > That would also work, but perhaps responding with the
>> actual
>> > > >>>> leader
>> > > >>>> >> > > offset-timestamp entries (corresponding to the fetched
>> > portion)
>> > > >>>> would
>> > > >>>> >> > > be exact and it should be small as well. Anyway, the main
>> > > >>>> motivation
>> > > >>>> >> > > in this was to avoid leaking server-side timestamps to the
>> > > >>>> >> > > message-format if people think it is worth it so the
>> > > >>>> alternatives are
>> > > >>>> >> > > implementation details. My original instinct was that it
>> also
>> > > >>>> avoids a
>> > > >>>> >> > > backwards incompatible change (but it does not because we
>> > also
>> > > >>>> have
>> > > >>>> >> > > the relative offset change).
>> > > >>>> >> > >
>> > > >>>> >> > > Thanks,
>> > > >>>> >> > >
>> > > >>>> >> > > Joel
>> > > >>>> >> > >
>> > > >>>> >> > > >
>> > > >>>> >> > > >
>> > > >>>> >> > > >
>> > > >>>> >> > > > On Thu, Sep 10, 2015 at 3:36 PM, Joel Koshy <
>> > > >>>> jjkosh...@gmail.com>
>> > > >>>> >> > wrote:
>> > > >>>> >> > > >
>> > > >>>> >> > > >> I just wanted to comment on a few points made earlier in
>> > > this
>> > > >>>> >> thread:
>> > > >>>> >> > > >>
>> > > >>>> >> > > >> Concerns on clock skew: at least for the original
>> > proposal's
>> > > >>>> scope
>> > > >>>> >> > > >> (which was more for honoring retention broker-side) this
>> > > >>>> would only
>> > > >>>> >> be
>> > > >>>> >> > > >> an issue when spanning leader movements right? i.e.,
>> > leader
>> > > >>>> >> migration
>> > > >>>> >> > > >> latency has to be much less than clock skew for this to
>> > be a
>> > > >>>> real
>> > > >>>> >> > > >> issue wouldn’t it?
>> > > >>>> >> > > >>
>> > > >>>> >> > > >> Client timestamp vs broker timestamp: I’m not sure Kafka
>> > > >>>> (brokers)
>> > > >>>> >> are
>> > > >>>> >> > > >> the right place to reason about client-side timestamps
>> > > >>>> precisely due
>> > > >>>> >> > > >> to the nuances that have been discussed at length in
>> this
>> > > >>>> thread. My
>> > > >>>> >> > > >> preference would have been to the timestamp (now called
>> > > >>>> >> > > >> LogAppendTimestamp) have nothing to do with the
>> > > applications.
>> > > >>>> Ewen
>> > > >>>> >> > > >> raised a valid concern about leaking such
>> > > >>>> “private/server-side”
>> > > >>>> >> > > >> timestamps into the protocol spec. i.e., it is fine to
>> > have
>> > > >>>> the
>> > > >>>> >> > > >> CreateTime which is expressly client-provided and
>> > immutable
>> > > >>>> >> > > >> thereafter, but the LogAppendTime is also going part of
>> > the
>> > > >>>> protocol
>> > > >>>> >> > > >> and it would be good to avoid exposure (to client
>> > > developers)
>> > > >>>> if
>> > > >>>> >> > > >> possible. Ok, so here is a slightly different approach
>> > that
>> > > I
>> > > >>>> was
>> > > >>>> >> just
>> > > >>>> >> > > >> thinking about (and did not think too far so it may not
>> > > >>>> work): do
>> > > >>>> >> not
>> > > >>>> >> > > >> add the LogAppendTime to messages. Instead, build the
>> > > >>>> time-based
>> > > >>>> >> index
>> > > >>>> >> > > >> on the server side on message arrival time alone.
>> > Introduce
>> > > a
>> > > >>>> new
>> > > >>>> >> > > >> ReplicaFetchRequest/Response pair. ReplicaFetchResponses
>> > > will
>> > > >>>> also
>> > > >>>> >> > > >> include the slice of the time-based index for the
>> follower
>> > > >>>> broker.
>> > > >>>> >> > > >> This way we can at least keep timestamps aligned across
>> > > >>>> brokers for
>> > > >>>> >> > > >> retention purposes. We do lose the append timestamp for
>> > > >>>> mirroring
>> > > >>>> >> > > >> pipelines (which appears to be the case in KIP-32 as
>> > well).
>> > > >>>> >> > > >>
>> > > >>>> >> > > >> Configurable index granularity: We can do this but I’m
>> not
>> > > >>>> sure it
>> > > >>>> >> is
>> > > >>>> >> > > >> very useful and as Jay noted, a major change from the
>> old
>> > > >>>> proposal
>> > > >>>> >> > > >> linked from the KIP is the sparse time-based index which
>> > we
>> > > >>>> felt was
>> > > >>>> >> > > >> essential to bound memory usage (and having timestamps
>> on
>> > > >>>> each log
>> > > >>>> >> > > >> index entry was probably a big waste since in the common
>> > > case
>> > > >>>> >> several
>> > > >>>> >> > > >> messages span the same timestamp). BTW another benefit
>> of
>> > > the
>> > > >>>> second
>> > > >>>> >> > > >> index is that it makes it easier to roll-back or throw
>> > away
>> > > if
>> > > >>>> >> > > >> necessary (vs. modifying the existing index format) -
>> > > >>>> although that
>> > > >>>> >> > > >> obviously does not help with rolling back the timestamp
>> > > >>>> change in
>> > > >>>> >> the
>> > > >>>> >> > > >> message format, but it is one less thing to worry about.
>> > > >>>> >> > > >>
>> > > >>>> >> > > >> Versioning: I’m not sure everyone is saying the same
>> thing
>> > > >>>> wrt the
>> > > >>>> >> > > >> scope of this. There is the record format change, but I
>> > also
>> > > >>>> think
>> > > >>>> >> > > >> this ties into all of the API versioning that we already
>> > > have
>> > > >>>> in
>> > > >>>> >> > > >> Kafka. The current API versioning approach works fine
>> for
>> > > >>>> >> > > >> upgrades/downgrades across official Kafka releases, but
>> > not
>> > > >>>> so well
>> > > >>>> >> > > >> between releases. (We almost got bitten by this at
>> > LinkedIn
>> > > >>>> with the
>> > > >>>> >> > > >> recent changes to various requests but were able to work
>> > > >>>> around
>> > > >>>> >> > > >> these.) We can clarify this in the follow-up KIP.
>> > > >>>> >> > > >>
>> > > >>>> >> > > >> Thanks,
>> > > >>>> >> > > >>
>> > > >>>> >> > > >> Joel
>> > > >>>> >> > > >>
>> > > >>>> >> > > >>
>> > > >>>> >> > > >> On Thu, Sep 10, 2015 at 3:00 PM, Jiangjie Qin
>> > > >>>> >> > <j...@linkedin.com.invalid
>> > > >>>> >> > > >
>> > > >>>> >> > > >> wrote:
>> > > >>>> >> > > >> > Hi Jay,
>> > > >>>> >> > > >> >
>> > > >>>> >> > > >> > I just changed the KIP title and updated the KIP page.
>> > > >>>> >> > > >> >
>> > > >>>> >> > > >> > And yes, we are working on a general version control
>> > > >>>> proposal to
>> > > >>>> >> > make
>> > > >>>> >> > > the
>> > > >>>> >> > > >> > protocol migration like this more smooth. I will also
>> > > >>>> create a KIP
>> > > >>>> >> > for
>> > > >>>> >> > > >> that
>> > > >>>> >> > > >> > soon.
>> > > >>>> >> > > >> >
>> > > >>>> >> > > >> > Thanks,
>> > > >>>> >> > > >> >
>> > > >>>> >> > > >> > Jiangjie (Becket) Qin
>> > > >>>> >> > > >> >
>> > > >>>> >> > > >> >
>> > > >>>> >> > > >> > On Thu, Sep 10, 2015 at 2:21 PM, Jay Kreps <
>> > > >>>> j...@confluent.io>
>> > > >>>> >> > wrote:
>> > > >>>> >> > > >> >
>> > > >>>> >> > > >> >> Great, can we change the name to something related to
>> > the
>> > > >>>> >> > > >> change--"KIP-31:
>> > > >>>> >> > > >> >> Move to relative offsets in compressed message sets".
>> > > >>>> >> > > >> >>
>> > > >>>> >> > > >> >> Also you had mentioned before you were going to
>> expand
>> > on
>> > > >>>> the
>> > > >>>> >> > > mechanics
>> > > >>>> >> > > >> of
>> > > >>>> >> > > >> >> handling these log format changes, right?
>> > > >>>> >> > > >> >>
>> > > >>>> >> > > >> >> -Jay
>> > > >>>> >> > > >> >>
>> > > >>>> >> > > >> >> On Thu, Sep 10, 2015 at 12:42 PM, Jiangjie Qin
>> > > >>>> >> > > >> <j...@linkedin.com.invalid>
>> > > >>>> >> > > >> >> wrote:
>> > > >>>> >> > > >> >>
>> > > >>>> >> > > >> >> > Neha and Jay,
>> > > >>>> >> > > >> >> >
>> > > >>>> >> > > >> >> > Thanks a lot for the feedback. Good point about
>> > > >>>> splitting the
>> > > >>>> >> > > >> >> discussion. I
>> > > >>>> >> > > >> >> > have split the proposal to three KIPs and it does
>> > make
>> > > >>>> each
>> > > >>>> >> > > discussion
>> > > >>>> >> > > >> >> more
>> > > >>>> >> > > >> >> > clear:
>> > > >>>> >> > > >> >> > KIP-31 - Message format change (Use relative
>> offset)
>> > > >>>> >> > > >> >> > KIP-32 - Add CreateTime and LogAppendTime to Kafka
>> > > >>>> message
>> > > >>>> >> > > >> >> > KIP-33 - Build a time-based log index
>> > > >>>> >> > > >> >> >
>> > > >>>> >> > > >> >> > KIP-33 can be a follow up KIP for KIP-32, so we can
>> > > >>>> discuss
>> > > >>>> >> about
>> > > >>>> >> > > >> KIP-31
>> > > >>>> >> > > >> >> > and KIP-32 first for now. I will create a separate
>> > > >>>> discussion
>> > > >>>> >> > > thread
>> > > >>>> >> > > >> for
>> > > >>>> >> > > >> >> > KIP-32 and reply the concerns you raised regarding
>> > the
>> > > >>>> >> timestamp.
>> > > >>>> >> > > >> >> >
>> > > >>>> >> > > >> >> > So far it looks there is no objection to KIP-31.
>> > Since
>> > > I
>> > > >>>> >> removed
>> > > >>>> >> > a
>> > > >>>> >> > > few
>> > > >>>> >> > > >> >> part
>> > > >>>> >> > > >> >> > from previous KIP and only left the relative offset
>> > > >>>> proposal,
>> > > >>>> >> it
>> > > >>>> >> > > >> would be
>> > > >>>> >> > > >> >> > great if people can take another look to see if
>> there
>> > > is
>> > > >>>> any
>> > > >>>> >> > > concerns.
>> > > >>>> >> > > >> >> >
>> > > >>>> >> > > >> >> > Thanks,
>> > > >>>> >> > > >> >> >
>> > > >>>> >> > > >> >> > Jiangjie (Becket) Qin
>> > > >>>> >> > > >> >> >
>> > > >>>> >> > > >> >> >
>> > > >>>> >> > > >> >> > On Tue, Sep 8, 2015 at 1:28 PM, Neha Narkhede <
>> > > >>>> >> n...@confluent.io
>> > > >>>> >> > >
>> > > >>>> >> > > >> wrote:
>> > > >>>> >> > > >> >> >
>> > > >>>> >> > > >> >> > > Becket,
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> > > Nice write-up. Few thoughts -
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> > > I'd split up the discussion for simplicity. Note
>> > that
>> > > >>>> you can
>> > > >>>> >> > > always
>> > > >>>> >> > > >> >> > group
>> > > >>>> >> > > >> >> > > several of these in one patch to reduce the
>> > protocol
>> > > >>>> changes
>> > > >>>> >> > > people
>> > > >>>> >> > > >> >> have
>> > > >>>> >> > > >> >> > to
>> > > >>>> >> > > >> >> > > deal with.This is just a suggestion, but I think
>> > the
>> > > >>>> >> following
>> > > >>>> >> > > split
>> > > >>>> >> > > >> >> > might
>> > > >>>> >> > > >> >> > > make it easier to tackle the changes being
>> > proposed -
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> > >    - Relative offsets
>> > > >>>> >> > > >> >> > >    - Introducing the concept of time
>> > > >>>> >> > > >> >> > >    - Time-based indexing (separate the usage of
>> the
>> > > >>>> timestamp
>> > > >>>> >> > > field
>> > > >>>> >> > > >> >> from
>> > > >>>> >> > > >> >> > >    how/whether we want to include a timestamp in
>> > the
>> > > >>>> message)
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> > > I'm a +1 on relative offsets, we should've done
>> it
>> > > >>>> back when
>> > > >>>> >> we
>> > > >>>> >> > > >> >> > introduced
>> > > >>>> >> > > >> >> > > it. Other than reducing the CPU overhead, this
>> will
>> > > >>>> also
>> > > >>>> >> reduce
>> > > >>>> >> > > the
>> > > >>>> >> > > >> >> > garbage
>> > > >>>> >> > > >> >> > > collection overhead on the brokers.
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> > > On the timestamp field, I generally agree that we
>> > > >>>> should add
>> > > >>>> >> a
>> > > >>>> >> > > >> >> timestamp
>> > > >>>> >> > > >> >> > to
>> > > >>>> >> > > >> >> > > a Kafka message but I'm not quite sold on how
>> this
>> > > KIP
>> > > >>>> >> suggests
>> > > >>>> >> > > the
>> > > >>>> >> > > >> >> > > timestamp be set. Will avoid repeating the
>> > downsides
>> > > >>>> of a
>> > > >>>> >> > broker
>> > > >>>> >> > > >> side
>> > > >>>> >> > > >> >> > > timestamp mentioned previously in this thread. I
>> > > think
>> > > >>>> the
>> > > >>>> >> > topic
>> > > >>>> >> > > of
>> > > >>>> >> > > >> >> > > including a timestamp in a Kafka message
>> requires a
>> > > >>>> lot more
>> > > >>>> >> > > thought
>> > > >>>> >> > > >> >> and
>> > > >>>> >> > > >> >> > > details than what's in this KIP. I'd suggest we
>> > make
>> > > >>>> it a
>> > > >>>> >> > > separate
>> > > >>>> >> > > >> KIP
>> > > >>>> >> > > >> >> > that
>> > > >>>> >> > > >> >> > > includes a list of all the different use cases
>> for
>> > > the
>> > > >>>> >> > timestamp
>> > > >>>> >> > > >> >> (beyond
>> > > >>>> >> > > >> >> > > log retention) including stream processing and
>> > > discuss
>> > > >>>> >> > tradeoffs
>> > > >>>> >> > > of
>> > > >>>> >> > > >> >> > > including client and broker side timestamps.
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> > > Agree with the benefit of time-based indexing,
>> but
>> > > >>>> haven't
>> > > >>>> >> had
>> > > >>>> >> > a
>> > > >>>> >> > > >> chance
>> > > >>>> >> > > >> >> > to
>> > > >>>> >> > > >> >> > > dive into the design details yet.
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> > > Thanks,
>> > > >>>> >> > > >> >> > > Neha
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> > > On Tue, Sep 8, 2015 at 10:57 AM, Jay Kreps <
>> > > >>>> j...@confluent.io
>> > > >>>> >> >
>> > > >>>> >> > > >> wrote:
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> > > > Hey Beckett,
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > > > I was proposing splitting up the KIP just for
>> > > >>>> simplicity of
>> > > >>>> >> > > >> >> discussion.
>> > > >>>> >> > > >> >> > > You
>> > > >>>> >> > > >> >> > > > can still implement them in one patch. I think
>> > > >>>> otherwise it
>> > > >>>> >> > > will
>> > > >>>> >> > > >> be
>> > > >>>> >> > > >> >> > hard
>> > > >>>> >> > > >> >> > > to
>> > > >>>> >> > > >> >> > > > discuss/vote on them since if you like the
>> offset
>> > > >>>> proposal
>> > > >>>> >> > but
>> > > >>>> >> > > not
>> > > >>>> >> > > >> >> the
>> > > >>>> >> > > >> >> > > time
>> > > >>>> >> > > >> >> > > > proposal what do you do?
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > > > Introducing a second notion of time into Kafka
>> > is a
>> > > >>>> pretty
>> > > >>>> >> > > massive
>> > > >>>> >> > > >> >> > > > philosophical change so it kind of warrants
>> it's
>> > > own
>> > > >>>> KIP I
>> > > >>>> >> > > think
>> > > >>>> >> > > >> it
>> > > >>>> >> > > >> >> > isn't
>> > > >>>> >> > > >> >> > > > just "Change message format".
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > > > WRT time I think one thing to clarify in the
>> > > >>>> proposal is
>> > > >>>> >> how
>> > > >>>> >> > MM
>> > > >>>> >> > > >> will
>> > > >>>> >> > > >> >> > have
>> > > >>>> >> > > >> >> > > > access to set the timestamp? Presumably this
>> will
>> > > be
>> > > >>>> a new
>> > > >>>> >> > > field
>> > > >>>> >> > > >> in
>> > > >>>> >> > > >> >> > > > ProducerRecord, right? If so then any user can
>> > set
>> > > >>>> the
>> > > >>>> >> > > timestamp,
>> > > >>>> >> > > >> >> > right?
>> > > >>>> >> > > >> >> > > > I'm not sure you answered the questions around
>> > how
>> > > >>>> this
>> > > >>>> >> will
>> > > >>>> >> > > work
>> > > >>>> >> > > >> for
>> > > >>>> >> > > >> >> > MM
>> > > >>>> >> > > >> >> > > > since when MM retains timestamps from multiple
>> > > >>>> partitions
>> > > >>>> >> > they
>> > > >>>> >> > > >> will
>> > > >>>> >> > > >> >> > then
>> > > >>>> >> > > >> >> > > be
>> > > >>>> >> > > >> >> > > > out of order and in the past (so the
>> > > >>>> >> > max(lastAppendedTimestamp,
>> > > >>>> >> > > >> >> > > > currentTimeMillis) override you proposed will
>> not
>> > > >>>> work,
>> > > >>>> >> > > right?).
>> > > >>>> >> > > >> If
>> > > >>>> >> > > >> >> we
>> > > >>>> >> > > >> >> > > > don't do this then when you set up mirroring
>> the
>> > > >>>> data will
>> > > >>>> >> > all
>> > > >>>> >> > > be
>> > > >>>> >> > > >> new
>> > > >>>> >> > > >> >> > and
>> > > >>>> >> > > >> >> > > > you have the same retention problem you
>> > described.
>> > > >>>> Maybe I
>> > > >>>> >> > > missed
>> > > >>>> >> > > >> >> > > > something...?
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > > > My main motivation is that given that both
>> Samza
>> > > and
>> > > >>>> Kafka
>> > > >>>> >> > > streams
>> > > >>>> >> > > >> >> are
>> > > >>>> >> > > >> >> > > > doing work that implies a mandatory
>> > client-defined
>> > > >>>> notion
>> > > >>>> >> of
>> > > >>>> >> > > >> time, I
>> > > >>>> >> > > >> >> > > really
>> > > >>>> >> > > >> >> > > > think introducing a different mandatory notion
>> of
>> > > >>>> time in
>> > > >>>> >> > > Kafka is
>> > > >>>> >> > > >> >> > going
>> > > >>>> >> > > >> >> > > to
>> > > >>>> >> > > >> >> > > > be quite odd. We should think hard about how
>> > > >>>> client-defined
>> > > >>>> >> > > time
>> > > >>>> >> > > >> >> could
>> > > >>>> >> > > >> >> > > > work. I'm not sure if it can, but I'm also not
>> > sure
>> > > >>>> that it
>> > > >>>> >> > > can't.
>> > > >>>> >> > > >> >> > Having
>> > > >>>> >> > > >> >> > > > both will be odd. Did you chat about this with
>> > > >>>> Yi/Kartik on
>> > > >>>> >> > the
>> > > >>>> >> > > >> Samza
>> > > >>>> >> > > >> >> > > side?
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > > > When you are saying it won't work you are
>> > assuming
>> > > >>>> some
>> > > >>>> >> > > particular
>> > > >>>> >> > > >> >> > > > implementation? Maybe that the index is a
>> > > >>>> monotonically
>> > > >>>> >> > > increasing
>> > > >>>> >> > > >> >> set
>> > > >>>> >> > > >> >> > of
>> > > >>>> >> > > >> >> > > > pointers to the least record with a timestamp
>> > > larger
>> > > >>>> than
>> > > >>>> >> the
>> > > >>>> >> > > >> index
>> > > >>>> >> > > >> >> > time?
>> > > >>>> >> > > >> >> > > > In other words a search for time X gives the
>> > > largest
>> > > >>>> offset
>> > > >>>> >> > at
>> > > >>>> >> > > >> which
>> > > >>>> >> > > >> >> > all
>> > > >>>> >> > > >> >> > > > records are <= X?
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > > > For retention, I agree with the problem you
>> point
>> > > >>>> out, but
>> > > >>>> >> I
>> > > >>>> >> > > think
>> > > >>>> >> > > >> >> what
>> > > >>>> >> > > >> >> > > you
>> > > >>>> >> > > >> >> > > > are saying in that case is that you want a size
>> > > >>>> limit too.
>> > > >>>> >> If
>> > > >>>> >> > > you
>> > > >>>> >> > > >> use
>> > > >>>> >> > > >> >> > > > system time you actually hit the same problem:
>> > say
>> > > >>>> you do a
>> > > >>>> >> > > full
>> > > >>>> >> > > >> dump
>> > > >>>> >> > > >> >> > of
>> > > >>>> >> > > >> >> > > a
>> > > >>>> >> > > >> >> > > > DB table with a setting of 7 days retention,
>> your
>> > > >>>> retention
>> > > >>>> >> > > will
>> > > >>>> >> > > >> >> > actually
>> > > >>>> >> > > >> >> > > > not get enforced for the first 7 days because
>> the
>> > > >>>> data is
>> > > >>>> >> > "new
>> > > >>>> >> > > to
>> > > >>>> >> > > >> >> > Kafka".
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > > > -Jay
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > > > On Mon, Sep 7, 2015 at 10:44 AM, Jiangjie Qin
>> > > >>>> >> > > >> >> > <j...@linkedin.com.invalid
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > > > wrote:
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > > > > Jay,
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > > > Thanks for the comments. Yes, there are
>> > actually
>> > > >>>> three
>> > > >>>> >> > > >> proposals as
>> > > >>>> >> > > >> >> > you
>> > > >>>> >> > > >> >> > > > > pointed out.
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > > > We will have a separate proposal for (1) -
>> > > version
>> > > >>>> >> control
>> > > >>>> >> > > >> >> mechanism.
>> > > >>>> >> > > >> >> > > We
>> > > >>>> >> > > >> >> > > > > actually thought about whether we want to
>> > > separate
>> > > >>>> 2 and
>> > > >>>> >> 3
>> > > >>>> >> > > >> >> internally
>> > > >>>> >> > > >> >> > > > > before creating the KIP. The reason we put 2
>> > and
>> > > 3
>> > > >>>> >> together
>> > > >>>> >> > > is
>> > > >>>> >> > > >> it
>> > > >>>> >> > > >> >> > will
>> > > >>>> >> > > >> >> > > > > saves us another cross board wire protocol
>> > > change.
>> > > >>>> Like
>> > > >>>> >> you
>> > > >>>> >> > > >> said,
>> > > >>>> >> > > >> >> we
>> > > >>>> >> > > >> >> > > have
>> > > >>>> >> > > >> >> > > > > to migrate all the clients in all languages.
>> To
>> > > >>>> some
>> > > >>>> >> > extent,
>> > > >>>> >> > > the
>> > > >>>> >> > > >> >> > effort
>> > > >>>> >> > > >> >> > > > to
>> > > >>>> >> > > >> >> > > > > spend on upgrading the clients can be even
>> > bigger
>> > > >>>> than
>> > > >>>> >> > > >> implementing
>> > > >>>> >> > > >> >> > the
>> > > >>>> >> > > >> >> > > > new
>> > > >>>> >> > > >> >> > > > > feature itself. So there are some attractions
>> > if
>> > > >>>> we can
>> > > >>>> >> do
>> > > >>>> >> > 2
>> > > >>>> >> > > >> and 3
>> > > >>>> >> > > >> >> > > > together
>> > > >>>> >> > > >> >> > > > > instead of separately. Maybe after (1) is
>> done
>> > it
>> > > >>>> will be
>> > > >>>> >> > > >> easier to
>> > > >>>> >> > > >> >> > do
>> > > >>>> >> > > >> >> > > > > protocol migration. But if we are able to
>> come
>> > to
>> > > >>>> an
>> > > >>>> >> > > agreement
>> > > >>>> >> > > >> on
>> > > >>>> >> > > >> >> the
>> > > >>>> >> > > >> >> > > > > timestamp solution, I would prefer to have it
>> > > >>>> together
>> > > >>>> >> with
>> > > >>>> >> > > >> >> relative
>> > > >>>> >> > > >> >> > > > offset
>> > > >>>> >> > > >> >> > > > > in the interest of avoiding another wire
>> > protocol
>> > > >>>> change
>> > > >>>> >> > (the
>> > > >>>> >> > > >> >> process
>> > > >>>> >> > > >> >> > > to
>> > > >>>> >> > > >> >> > > > > migrate to relative offset is exactly the
>> same
>> > as
>> > > >>>> migrate
>> > > >>>> >> > to
>> > > >>>> >> > > >> >> message
>> > > >>>> >> > > >> >> > > with
>> > > >>>> >> > > >> >> > > > > timestamp).
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > > > In terms of timestamp. I completely agree
>> that
>> > > >>>> having
>> > > >>>> >> > client
>> > > >>>> >> > > >> >> > timestamp
>> > > >>>> >> > > >> >> > > is
>> > > >>>> >> > > >> >> > > > > more useful if we can make sure the timestamp
>> > is
>> > > >>>> good.
>> > > >>>> >> But
>> > > >>>> >> > in
>> > > >>>> >> > > >> >> reality
>> > > >>>> >> > > >> >> > > > that
>> > > >>>> >> > > >> >> > > > > can be a really big *IF*. I think the problem
>> > is
>> > > >>>> exactly
>> > > >>>> >> as
>> > > >>>> >> > > Ewen
>> > > >>>> >> > > >> >> > > > mentioned,
>> > > >>>> >> > > >> >> > > > > if we let the client to set the timestamp, it
>> > > >>>> would be
>> > > >>>> >> very
>> > > >>>> >> > > hard
>> > > >>>> >> > > >> >> for
>> > > >>>> >> > > >> >> > > the
>> > > >>>> >> > > >> >> > > > > broker to utilize it. If broker apply
>> retention
>> > > >>>> policy
>> > > >>>> >> > based
>> > > >>>> >> > > on
>> > > >>>> >> > > >> the
>> > > >>>> >> > > >> >> > > > client
>> > > >>>> >> > > >> >> > > > > timestamp. One misbehave producer can
>> > potentially
>> > > >>>> >> > completely
>> > > >>>> >> > > >> mess
>> > > >>>> >> > > >> >> up
>> > > >>>> >> > > >> >> > > the
>> > > >>>> >> > > >> >> > > > > retention policy on the broker. Although
>> people
>> > > >>>> don't
>> > > >>>> >> care
>> > > >>>> >> > > about
>> > > >>>> >> > > >> >> > server
>> > > >>>> >> > > >> >> > > > > side timestamp. People do care a lot when
>> > > timestamp
>> > > >>>> >> breaks.
>> > > >>>> >> > > >> >> Searching
>> > > >>>> >> > > >> >> > > by
>> > > >>>> >> > > >> >> > > > > timestamp is a really important use case even
>> > > >>>> though it
>> > > >>>> >> is
>> > > >>>> >> > > not
>> > > >>>> >> > > >> used
>> > > >>>> >> > > >> >> > as
>> > > >>>> >> > > >> >> > > > > often as searching by offset. It has
>> > significant
>> > > >>>> direct
>> > > >>>> >> > > impact
>> > > >>>> >> > > >> on
>> > > >>>> >> > > >> >> RTO
>> > > >>>> >> > > >> >> > > > when
>> > > >>>> >> > > >> >> > > > > there is a cross cluster failover as Todd
>> > > >>>> mentioned.
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > > > The trick using max(lastAppendedTimestamp,
>> > > >>>> >> > currentTimeMillis)
>> > > >>>> >> > > >> is to
>> > > >>>> >> > > >> >> > > > > guarantee monotonic increase of the
>> timestamp.
>> > > Many
>> > > >>>> >> > > commercial
>> > > >>>> >> > > >> >> system
>> > > >>>> >> > > >> >> > > > > actually do something similar to this to
>> solve
>> > > the
>> > > >>>> time
>> > > >>>> >> > skew.
>> > > >>>> >> > > >> About
>> > > >>>> >> > > >> >> > > > > changing the time, I am not sure if people
>> use
>> > > NTP
>> > > >>>> like
>> > > >>>> >> > > using a
>> > > >>>> >> > > >> >> watch
>> > > >>>> >> > > >> >> > > to
>> > > >>>> >> > > >> >> > > > > just set it forward/backward by an hour or
>> so.
>> > > The
>> > > >>>> time
>> > > >>>> >> > > >> adjustment
>> > > >>>> >> > > >> >> I
>> > > >>>> >> > > >> >> > > used
>> > > >>>> >> > > >> >> > > > > to do is typically to adjust something like a
>> > > >>>> minute  /
>> > > >>>> >> > > week. So
>> > > >>>> >> > > >> >> for
>> > > >>>> >> > > >> >> > > each
>> > > >>>> >> > > >> >> > > > > second, there might be a few mircoseconds
>> > > >>>> slower/faster
>> > > >>>> >> but
>> > > >>>> >> > > >> should
>> > > >>>> >> > > >> >> > not
>> > > >>>> >> > > >> >> > > > > break the clock completely to make sure all
>> the
>> > > >>>> >> time-based
>> > > >>>> >> > > >> >> > transactions
>> > > >>>> >> > > >> >> > > > are
>> > > >>>> >> > > >> >> > > > > not affected. The one minute change will be
>> > done
>> > > >>>> within a
>> > > >>>> >> > > week
>> > > >>>> >> > > >> but
>> > > >>>> >> > > >> >> > not
>> > > >>>> >> > > >> >> > > > > instantly.
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > > > Personally, I think having client side
>> > timestamp
>> > > >>>> will be
>> > > >>>> >> > > useful
>> > > >>>> >> > > >> if
>> > > >>>> >> > > >> >> we
>> > > >>>> >> > > >> >> > > > don't
>> > > >>>> >> > > >> >> > > > > need to put the broker and data integrity
>> under
>> > > >>>> risk. If
>> > > >>>> >> we
>> > > >>>> >> > > >> have to
>> > > >>>> >> > > >> >> > > > choose
>> > > >>>> >> > > >> >> > > > > from one of them but not both. I would prefer
>> > > >>>> server side
>> > > >>>> >> > > >> timestamp
>> > > >>>> >> > > >> >> > > > because
>> > > >>>> >> > > >> >> > > > > for client side timestamp there is always a
>> > plan
>> > > B
>> > > >>>> which
>> > > >>>> >> is
>> > > >>>> >> > > >> putting
>> > > >>>> >> > > >> >> > the
>> > > >>>> >> > > >> >> > > > > timestamp into payload.
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > > > Another reason I am reluctant to use the
>> client
>> > > >>>> side
>> > > >>>> >> > > timestamp
>> > > >>>> >> > > >> is
>> > > >>>> >> > > >> >> > that
>> > > >>>> >> > > >> >> > > it
>> > > >>>> >> > > >> >> > > > > is always dangerous to mix the control plane
>> > with
>> > > >>>> data
>> > > >>>> >> > > plane. IP
>> > > >>>> >> > > >> >> did
>> > > >>>> >> > > >> >> > > this
>> > > >>>> >> > > >> >> > > > > and it has caused so many different breaches
>> so
>> > > >>>> people
>> > > >>>> >> are
>> > > >>>> >> > > >> >> migrating
>> > > >>>> >> > > >> >> > to
>> > > >>>> >> > > >> >> > > > > something like MPLS. An example in Kafka is
>> > that
>> > > >>>> any
>> > > >>>> >> client
>> > > >>>> >> > > can
>> > > >>>> >> > > >> >> > > > construct a
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >>
>> > > >>>>
>> LeaderAndIsrRequest/UpdateMetadataRequest/ContorlledShutdownRequest
>> > > >>>> >> > > >> >> > > (you
>> > > >>>> >> > > >> >> > > > > name it) and send it to the broker to mess up
>> > the
>> > > >>>> entire
>> > > >>>> >> > > >> cluster,
>> > > >>>> >> > > >> >> > also
>> > > >>>> >> > > >> >> > > as
>> > > >>>> >> > > >> >> > > > > we already noticed a busy cluster can respond
>> > > >>>> quite slow
>> > > >>>> >> to
>> > > >>>> >> > > >> >> > controller
>> > > >>>> >> > > >> >> > > > > messages. So it would really be nice if we
>> can
>> > > >>>> avoid
>> > > >>>> >> giving
>> > > >>>> >> > > the
>> > > >>>> >> > > >> >> power
>> > > >>>> >> > > >> >> > > to
>> > > >>>> >> > > >> >> > > > > clients to control the log retention.
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > > > Thanks,
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > > > Jiangjie (Becket) Qin
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > > > On Sun, Sep 6, 2015 at 9:54 PM, Todd Palino <
>> > > >>>> >> > > tpal...@gmail.com>
>> > > >>>> >> > > >> >> > wrote:
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > > > > So, with regards to why you want to search
>> by
>> > > >>>> >> timestamp,
>> > > >>>> >> > > the
>> > > >>>> >> > > >> >> > biggest
>> > > >>>> >> > > >> >> > > > > > problem I've seen is with consumers who
>> want
>> > to
>> > > >>>> reset
>> > > >>>> >> > their
>> > > >>>> >> > > >> >> > > timestamps
>> > > >>>> >> > > >> >> > > > > to a
>> > > >>>> >> > > >> >> > > > > > specific point, whether it is to replay a
>> > > certain
>> > > >>>> >> amount
>> > > >>>> >> > of
>> > > >>>> >> > > >> >> > messages,
>> > > >>>> >> > > >> >> > > > or
>> > > >>>> >> > > >> >> > > > > to
>> > > >>>> >> > > >> >> > > > > > rewind to before some problem state
>> existed.
>> > > This
>> > > >>>> >> happens
>> > > >>>> >> > > more
>> > > >>>> >> > > >> >> > often
>> > > >>>> >> > > >> >> > > > than
>> > > >>>> >> > > >> >> > > > > > anyone would like.
>> > > >>>> >> > > >> >> > > > > >
>> > > >>>> >> > > >> >> > > > > > To handle this now we need to constantly
>> > export
>> > > >>>> the
>> > > >>>> >> > > broker's
>> > > >>>> >> > > >> >> offset
>> > > >>>> >> > > >> >> > > for
>> > > >>>> >> > > >> >> > > > > > every partition to a time-series database
>> and
>> > > >>>> then use
>> > > >>>> >> > > >> external
>> > > >>>> >> > > >> >> > > > processes
>> > > >>>> >> > > >> >> > > > > > to query this. I know we're not the only
>> ones
>> > > >>>> doing
>> > > >>>> >> this.
>> > > >>>> >> > > The
>> > > >>>> >> > > >> way
>> > > >>>> >> > > >> >> > the
>> > > >>>> >> > > >> >> > > > > > broker handles requests for offsets by
>> > > timestamp
>> > > >>>> is a
>> > > >>>> >> > > little
>> > > >>>> >> > > >> >> obtuse
>> > > >>>> >> > > >> >> > > > > > (explain it to anyone without intimate
>> > > knowledge
>> > > >>>> of the
>> > > >>>> >> > > >> internal
>> > > >>>> >> > > >> >> > > > workings
>> > > >>>> >> > > >> >> > > > > > of the broker - every time I do I see
>> this).
>> > In
>> > > >>>> >> addition,
>> > > >>>> >> > > as
>> > > >>>> >> > > >> >> Becket
>> > > >>>> >> > > >> >> > > > > pointed
>> > > >>>> >> > > >> >> > > > > > out, it causes problems specifically with
>> > > >>>> retention of
>> > > >>>> >> > > >> messages
>> > > >>>> >> > > >> >> by
>> > > >>>> >> > > >> >> > > time
>> > > >>>> >> > > >> >> > > > > > when you move partitions around.
>> > > >>>> >> > > >> >> > > > > >
>> > > >>>> >> > > >> >> > > > > > I'm deliberately avoiding the discussion of
>> > > what
>> > > >>>> >> > timestamp
>> > > >>>> >> > > to
>> > > >>>> >> > > >> >> use.
>> > > >>>> >> > > >> >> > I
>> > > >>>> >> > > >> >> > > > can
>> > > >>>> >> > > >> >> > > > > > see the argument either way, though I tend
>> to
>> > > >>>> lean
>> > > >>>> >> > towards
>> > > >>>> >> > > the
>> > > >>>> >> > > >> >> idea
>> > > >>>> >> > > >> >> > > > that
>> > > >>>> >> > > >> >> > > > > > the broker timestamp is the only viable
>> > source
>> > > >>>> of truth
>> > > >>>> >> > in
>> > > >>>> >> > > >> this
>> > > >>>> >> > > >> >> > > > > situation.
>> > > >>>> >> > > >> >> > > > > >
>> > > >>>> >> > > >> >> > > > > > -Todd
>> > > >>>> >> > > >> >> > > > > >
>> > > >>>> >> > > >> >> > > > > >
>> > > >>>> >> > > >> >> > > > > > On Sun, Sep 6, 2015 at 7:08 PM, Ewen
>> > > >>>> Cheslack-Postava <
>> > > >>>> >> > > >> >> > > > e...@confluent.io
>> > > >>>> >> > > >> >> > > > > >
>> > > >>>> >> > > >> >> > > > > > wrote:
>> > > >>>> >> > > >> >> > > > > >
>> > > >>>> >> > > >> >> > > > > > > On Sun, Sep 6, 2015 at 4:57 PM, Jay
>> Kreps <
>> > > >>>> >> > > j...@confluent.io
>> > > >>>> >> > > >> >
>> > > >>>> >> > > >> >> > > wrote:
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > > > >
>> > > >>>> >> > > >> >> > > > > > > > 2. Nobody cares what time it is on the
>> > > >>>> server.
>> > > >>>> >> > > >> >> > > > > > > >
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > > > This is a good way of summarizing the
>> > issue I
>> > > >>>> was
>> > > >>>> >> > trying
>> > > >>>> >> > > to
>> > > >>>> >> > > >> get
>> > > >>>> >> > > >> >> > at,
>> > > >>>> >> > > >> >> > > > > from
>> > > >>>> >> > > >> >> > > > > > an
>> > > >>>> >> > > >> >> > > > > > > app's perspective. Of the 3 stated goals
>> of
>> > > >>>> the KIP,
>> > > >>>> >> #2
>> > > >>>> >> > > (lot
>> > > >>>> >> > > >> >> > > > retention)
>> > > >>>> >> > > >> >> > > > > > is
>> > > >>>> >> > > >> >> > > > > > > reasonably handled by a server-side
>> > > timestamp.
>> > > >>>> I
>> > > >>>> >> really
>> > > >>>> >> > > just
>> > > >>>> >> > > >> >> care
>> > > >>>> >> > > >> >> > > > that
>> > > >>>> >> > > >> >> > > > > a
>> > > >>>> >> > > >> >> > > > > > > message is there long enough that I have
>> a
>> > > >>>> chance to
>> > > >>>> >> > > process
>> > > >>>> >> > > >> >> it.
>> > > >>>> >> > > >> >> > #3
>> > > >>>> >> > > >> >> > > > > > > (searching by timestamp) only seems
>> useful
>> > if
>> > > >>>> we can
>> > > >>>> >> > > >> guarantee
>> > > >>>> >> > > >> >> > the
>> > > >>>> >> > > >> >> > > > > > > server-side timestamp is close enough to
>> > the
>> > > >>>> original
>> > > >>>> >> > > >> >> client-side
>> > > >>>> >> > > >> >> > > > > > > timestamp, and any mirror maker step
>> seems
>> > to
>> > > >>>> break
>> > > >>>> >> > that
>> > > >>>> >> > > >> (even
>> > > >>>> >> > > >> >> > > > ignoring
>> > > >>>> >> > > >> >> > > > > > any
>> > > >>>> >> > > >> >> > > > > > > issues with broker availability).
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > > > I'm also wondering whether optimizing for
>> > > >>>> >> > > >> search-by-timestamp
>> > > >>>> >> > > >> >> on
>> > > >>>> >> > > >> >> > > the
>> > > >>>> >> > > >> >> > > > > > broker
>> > > >>>> >> > > >> >> > > > > > > is really something we want to do given
>> > that
>> > > >>>> messages
>> > > >>>> >> > > aren't
>> > > >>>> >> > > >> >> > really
>> > > >>>> >> > > >> >> > > > > > > guaranteed to be ordered by
>> > application-level
>> > > >>>> >> > timestamps
>> > > >>>> >> > > on
>> > > >>>> >> > > >> the
>> > > >>>> >> > > >> >> > > > broker.
>> > > >>>> >> > > >> >> > > > > > Is
>> > > >>>> >> > > >> >> > > > > > > part of the need for this just due to the
>> > > >>>> current
>> > > >>>> >> > > consumer
>> > > >>>> >> > > >> APIs
>> > > >>>> >> > > >> >> > > being
>> > > >>>> >> > > >> >> > > > > > > difficult to work with? For example,
>> could
>> > > you
>> > > >>>> >> > implement
>> > > >>>> >> > > >> this
>> > > >>>> >> > > >> >> > > pretty
>> > > >>>> >> > > >> >> > > > > > easily
>> > > >>>> >> > > >> >> > > > > > > client side just the way you would
>> > > >>>> broker-side? I'd
>> > > >>>> >> > > imagine
>> > > >>>> >> > > >> a
>> > > >>>> >> > > >> >> > > couple
>> > > >>>> >> > > >> >> > > > of
>> > > >>>> >> > > >> >> > > > > > > random seeks + reads during very rare
>> > > >>>> occasions (i.e.
>> > > >>>> >> > > when
>> > > >>>> >> > > >> the
>> > > >>>> >> > > >> >> > app
>> > > >>>> >> > > >> >> > > > > starts
>> > > >>>> >> > > >> >> > > > > > > up) wouldn't be a problem
>> performance-wise.
>> > > Or
>> > > >>>> is it
>> > > >>>> >> > also
>> > > >>>> >> > > >> that
>> > > >>>> >> > > >> >> > you
>> > > >>>> >> > > >> >> > > > need
>> > > >>>> >> > > >> >> > > > > > the
>> > > >>>> >> > > >> >> > > > > > > broker to enforce things like
>> monotonically
>> > > >>>> >> increasing
>> > > >>>> >> > > >> >> timestamps
>> > > >>>> >> > > >> >> > > > since
>> > > >>>> >> > > >> >> > > > > > you
>> > > >>>> >> > > >> >> > > > > > > can't do the query properly and
>> efficiently
>> > > >>>> without
>> > > >>>> >> > that
>> > > >>>> >> > > >> >> > guarantee,
>> > > >>>> >> > > >> >> > > > and
>> > > >>>> >> > > >> >> > > > > > > therefore what applications are actually
>> > > >>>> looking for
>> > > >>>> >> > *is*
>> > > >>>> >> > > >> >> > > broker-side
>> > > >>>> >> > > >> >> > > > > > > timestamps?
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > > > -Ewen
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > > > > Consider cases where data is being
>> copied
>> > > >>>> from a
>> > > >>>> >> > > database
>> > > >>>> >> > > >> or
>> > > >>>> >> > > >> >> > from
>> > > >>>> >> > > >> >> > > > log
>> > > >>>> >> > > >> >> > > > > > > > files. In steady-state the server time
>> is
>> > > >>>> very
>> > > >>>> >> close
>> > > >>>> >> > to
>> > > >>>> >> > > >> the
>> > > >>>> >> > > >> >> > > client
>> > > >>>> >> > > >> >> > > > > time
>> > > >>>> >> > > >> >> > > > > > > if
>> > > >>>> >> > > >> >> > > > > > > > their clocks are sync'd (see 1) but
>> there
>> > > >>>> will be
>> > > >>>> >> > > times of
>> > > >>>> >> > > >> >> > large
>> > > >>>> >> > > >> >> > > > > > > divergence
>> > > >>>> >> > > >> >> > > > > > > > when the copying process is stopped or
>> > > falls
>> > > >>>> >> behind.
>> > > >>>> >> > > When
>> > > >>>> >> > > >> >> this
>> > > >>>> >> > > >> >> > > > occurs
>> > > >>>> >> > > >> >> > > > > > it
>> > > >>>> >> > > >> >> > > > > > > is
>> > > >>>> >> > > >> >> > > > > > > > clear that the time the data arrived on
>> > the
>> > > >>>> server
>> > > >>>> >> is
>> > > >>>> >> > > >> >> > irrelevant,
>> > > >>>> >> > > >> >> > > > it
>> > > >>>> >> > > >> >> > > > > is
>> > > >>>> >> > > >> >> > > > > > > the
>> > > >>>> >> > > >> >> > > > > > > > source timestamp that matters. This is
>> > the
>> > > >>>> problem
>> > > >>>> >> > you
>> > > >>>> >> > > are
>> > > >>>> >> > > >> >> > trying
>> > > >>>> >> > > >> >> > > > to
>> > > >>>> >> > > >> >> > > > > > fix
>> > > >>>> >> > > >> >> > > > > > > by
>> > > >>>> >> > > >> >> > > > > > > > retaining the mm timestamp but really
>> the
>> > > >>>> client
>> > > >>>> >> > should
>> > > >>>> >> > > >> >> always
>> > > >>>> >> > > >> >> > > set
>> > > >>>> >> > > >> >> > > > > the
>> > > >>>> >> > > >> >> > > > > > > time
>> > > >>>> >> > > >> >> > > > > > > > with the use of server-side time as a
>> > > >>>> fallback. It
>> > > >>>> >> > > would
>> > > >>>> >> > > >> be
>> > > >>>> >> > > >> >> > worth
>> > > >>>> >> > > >> >> > > > > > talking
>> > > >>>> >> > > >> >> > > > > > > > to the Samza folks and reading through
>> > this
>> > > >>>> blog
>> > > >>>> >> > post (
>> > > >>>> >> > > >> >> > > > > > > >
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > >
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> >
>> > > >>>> >> > > >> >>
>> > > >>>> >> > > >>
>> > > >>>> >> > >
>> > > >>>> >> >
>> > > >>>> >>
>> > > >>>>
>> > >
>> >
>> http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html
>> > > >>>> >> > > >> >> > > > > > > > )
>> > > >>>> >> > > >> >> > > > > > > > on this subject since we went through
>> > > similar
>> > > >>>> >> > > learnings on
>> > > >>>> >> > > >> >> the
>> > > >>>> >> > > >> >> > > > stream
>> > > >>>> >> > > >> >> > > > > > > > processing side.
>> > > >>>> >> > > >> >> > > > > > > >
>> > > >>>> >> > > >> >> > > > > > > > I think the implication of these two is
>> > > that
>> > > >>>> we
>> > > >>>> >> need
>> > > >>>> >> > a
>> > > >>>> >> > > >> >> proposal
>> > > >>>> >> > > >> >> > > > that
>> > > >>>> >> > > >> >> > > > > > > > handles potentially very out-of-order
>> > > >>>> timestamps in
>> > > >>>> >> > > some
>> > > >>>> >> > > >> kind
>> > > >>>> >> > > >> >> > of
>> > > >>>> >> > > >> >> > > > > sanish
>> > > >>>> >> > > >> >> > > > > > > way
>> > > >>>> >> > > >> >> > > > > > > > (buggy clients will set something
>> totally
>> > > >>>> wrong as
>> > > >>>> >> > the
>> > > >>>> >> > > >> time).
>> > > >>>> >> > > >> >> > > > > > > >
>> > > >>>> >> > > >> >> > > > > > > > -Jay
>> > > >>>> >> > > >> >> > > > > > > >
>> > > >>>> >> > > >> >> > > > > > > > On Sun, Sep 6, 2015 at 4:22 PM, Jay
>> > Kreps <
>> > > >>>> >> > > >> j...@confluent.io>
>> > > >>>> >> > > >> >> > > > wrote:
>> > > >>>> >> > > >> >> > > > > > > >
>> > > >>>> >> > > >> >> > > > > > > > > The magic byte is used to version
>> > message
>> > > >>>> format
>> > > >>>> >> so
>> > > >>>> >> > > >> we'll
>> > > >>>> >> > > >> >> > need
>> > > >>>> >> > > >> >> > > to
>> > > >>>> >> > > >> >> > > > > > make
>> > > >>>> >> > > >> >> > > > > > > > > sure that check is in place--I
>> actually
>> > > >>>> don't see
>> > > >>>> >> > it
>> > > >>>> >> > > in
>> > > >>>> >> > > >> the
>> > > >>>> >> > > >> >> > > > current
>> > > >>>> >> > > >> >> > > > > > > > > consumer code which I think is a bug
>> we
>> > > >>>> should
>> > > >>>> >> fix
>> > > >>>> >> > > for
>> > > >>>> >> > > >> the
>> > > >>>> >> > > >> >> > next
>> > > >>>> >> > > >> >> > > > > > release
>> > > >>>> >> > > >> >> > > > > > > > > (filed KAFKA-2523). The purpose of
>> that
>> > > >>>> field is
>> > > >>>> >> so
>> > > >>>> >> > > >> there
>> > > >>>> >> > > >> >> is
>> > > >>>> >> > > >> >> > a
>> > > >>>> >> > > >> >> > > > > clear
>> > > >>>> >> > > >> >> > > > > > > > check
>> > > >>>> >> > > >> >> > > > > > > > > on the format rather than the
>> scrambled
>> > > >>>> scenarios
>> > > >>>> >> > > Becket
>> > > >>>> >> > > >> >> > > > describes.
>> > > >>>> >> > > >> >> > > > > > > > >
>> > > >>>> >> > > >> >> > > > > > > > > Also, Becket, I don't think just
>> fixing
>> > > >>>> the java
>> > > >>>> >> > > client
>> > > >>>> >> > > >> is
>> > > >>>> >> > > >> >> > > > > sufficient
>> > > >>>> >> > > >> >> > > > > > > as
>> > > >>>> >> > > >> >> > > > > > > > > that would break other clients--i.e.
>> if
>> > > >>>> anyone
>> > > >>>> >> > > writes a
>> > > >>>> >> > > >> v1
>> > > >>>> >> > > >> >> > > > > messages,
>> > > >>>> >> > > >> >> > > > > > > even
>> > > >>>> >> > > >> >> > > > > > > > > by accident, any non-v1-capable
>> > consumer
>> > > >>>> will
>> > > >>>> >> > break.
>> > > >>>> >> > > I
>> > > >>>> >> > > >> >> think
>> > > >>>> >> > > >> >> > we
>> > > >>>> >> > > >> >> > > > > > > probably
>> > > >>>> >> > > >> >> > > > > > > > > need a way to have the server ensure
>> a
>> > > >>>> particular
>> > > >>>> >> > > >> message
>> > > >>>> >> > > >> >> > > format
>> > > >>>> >> > > >> >> > > > > > either
>> > > >>>> >> > > >> >> > > > > > > > at
>> > > >>>> >> > > >> >> > > > > > > > > read or write time.
>> > > >>>> >> > > >> >> > > > > > > > >
>> > > >>>> >> > > >> >> > > > > > > > > -Jay
>> > > >>>> >> > > >> >> > > > > > > > >
>> > > >>>> >> > > >> >> > > > > > > > > On Thu, Sep 3, 2015 at 3:47 PM,
>> > Jiangjie
>> > > >>>> Qin
>> > > >>>> >> > > >> >> > > > > > <j...@linkedin.com.invalid
>> > > >>>> >> > > >> >> > > > > > > >
>> > > >>>> >> > > >> >> > > > > > > > > wrote:
>> > > >>>> >> > > >> >> > > > > > > > >
>> > > >>>> >> > > >> >> > > > > > > > >> Hi Guozhang,
>> > > >>>> >> > > >> >> > > > > > > > >>
>> > > >>>> >> > > >> >> > > > > > > > >> I checked the code again. Actually
>> CRC
>> > > >>>> check
>> > > >>>> >> > > probably
>> > > >>>> >> > > >> >> won't
>> > > >>>> >> > > >> >> > > > fail.
>> > > >>>> >> > > >> >> > > > > > The
>> > > >>>> >> > > >> >> > > > > > > > >> newly
>> > > >>>> >> > > >> >> > > > > > > > >> added timestamp field might be
>> treated
>> > > as
>> > > >>>> >> > keyLength
>> > > >>>> >> > > >> >> instead,
>> > > >>>> >> > > >> >> > > so
>> > > >>>> >> > > >> >> > > > we
>> > > >>>> >> > > >> >> > > > > > are
>> > > >>>> >> > > >> >> > > > > > > > >> likely to receive an
>> > > >>>> IllegalArgumentException
>> > > >>>> >> when
>> > > >>>> >> > > try
>> > > >>>> >> > > >> to
>> > > >>>> >> > > >> >> > read
>> > > >>>> >> > > >> >> > > > the
>> > > >>>> >> > > >> >> > > > > > > key.
>> > > >>>> >> > > >> >> > > > > > > > >> I'll update the KIP.
>> > > >>>> >> > > >> >> > > > > > > > >>
>> > > >>>> >> > > >> >> > > > > > > > >> Thanks,
>> > > >>>> >> > > >> >> > > > > > > > >>
>> > > >>>> >> > > >> >> > > > > > > > >> Jiangjie (Becket) Qin
>> > > >>>> >> > > >> >> > > > > > > > >>
>> > > >>>> >> > > >> >> > > > > > > > >> On Thu, Sep 3, 2015 at 12:48 PM,
>> > > Jiangjie
>> > > >>>> Qin <
>> > > >>>> >> > > >> >> > > > j...@linkedin.com>
>> > > >>>> >> > > >> >> > > > > > > > wrote:
>> > > >>>> >> > > >> >> > > > > > > > >>
>> > > >>>> >> > > >> >> > > > > > > > >> > Hi, Guozhang,
>> > > >>>> >> > > >> >> > > > > > > > >> >
>> > > >>>> >> > > >> >> > > > > > > > >> > Thanks for reading the KIP. By
>> "old
>> > > >>>> >> consumer", I
>> > > >>>> >> > > >> meant
>> > > >>>> >> > > >> >> the
>> > > >>>> >> > > >> >> > > > > > > > >> > ZookeeperConsumerConnector in
>> trunk
>> > > >>>> now, i.e.
>> > > >>>> >> > > without
>> > > >>>> >> > > >> >> this
>> > > >>>> >> > > >> >> > > bug
>> > > >>>> >> > > >> >> > > > > > > fixed.
>> > > >>>> >> > > >> >> > > > > > > > >> If we
>> > > >>>> >> > > >> >> > > > > > > > >> > fix the ZookeeperConsumerConnector
>> > > then
>> > > >>>> it
>> > > >>>> >> will
>> > > >>>> >> > > throw
>> > > >>>> >> > > >> >> > > > exception
>> > > >>>> >> > > >> >> > > > > > > > >> complaining
>> > > >>>> >> > > >> >> > > > > > > > >> > about the unsupported version when
>> > it
>> > > >>>> sees
>> > > >>>> >> > message
>> > > >>>> >> > > >> >> format
>> > > >>>> >> > > >> >> > > V1.
>> > > >>>> >> > > >> >> > > > > > What I
>> > > >>>> >> > > >> >> > > > > > > > was
>> > > >>>> >> > > >> >> > > > > > > > >> > trying to say is that if we have
>> > some
>> > > >>>> >> > > >> >> > > > ZookeeperConsumerConnector
>> > > >>>> >> > > >> >> > > > > > > > running
>> > > >>>> >> > > >> >> > > > > > > > >> > without the fix, the consumer will
>> > > >>>> complain
>> > > >>>> >> > about
>> > > >>>> >> > > CRC
>> > > >>>> >> > > >> >> > > mismatch
>> > > >>>> >> > > >> >> > > > > > > instead
>> > > >>>> >> > > >> >> > > > > > > > >> of
>> > > >>>> >> > > >> >> > > > > > > > >> > unsupported version.
>> > > >>>> >> > > >> >> > > > > > > > >> >
>> > > >>>> >> > > >> >> > > > > > > > >> > Thanks,
>> > > >>>> >> > > >> >> > > > > > > > >> >
>> > > >>>> >> > > >> >> > > > > > > > >> > Jiangjie (Becket) Qin
>> > > >>>> >> > > >> >> > > > > > > > >> >
>> > > >>>> >> > > >> >> > > > > > > > >> > On Thu, Sep 3, 2015 at 12:15 PM,
>> > > >>>> Guozhang
>> > > >>>> >> Wang <
>> > > >>>> >> > > >> >> > > > > > wangg...@gmail.com>
>> > > >>>> >> > > >> >> > > > > > > > >> wrote:
>> > > >>>> >> > > >> >> > > > > > > > >> >
>> > > >>>> >> > > >> >> > > > > > > > >> >> Thanks for the write-up Jiangjie.
>> > > >>>> >> > > >> >> > > > > > > > >> >>
>> > > >>>> >> > > >> >> > > > > > > > >> >> One comment about migration plan:
>> > > "For
>> > > >>>> old
>> > > >>>> >> > > >> consumers,
>> > > >>>> >> > > >> >> if
>> > > >>>> >> > > >> >> > > they
>> > > >>>> >> > > >> >> > > > > see
>> > > >>>> >> > > >> >> > > > > > > the
>> > > >>>> >> > > >> >> > > > > > > > >> new
>> > > >>>> >> > > >> >> > > > > > > > >> >> protocol the CRC check will
>> fail"..
>> > > >>>> >> > > >> >> > > > > > > > >> >>
>> > > >>>> >> > > >> >> > > > > > > > >> >> Do you mean this bug in the old
>> > > >>>> consumer
>> > > >>>> >> cannot
>> > > >>>> >> > > be
>> > > >>>> >> > > >> >> fixed
>> > > >>>> >> > > >> >> > > in a
>> > > >>>> >> > > >> >> > > > > > > > >> >> backward-compatible way?
>> > > >>>> >> > > >> >> > > > > > > > >> >>
>> > > >>>> >> > > >> >> > > > > > > > >> >> Guozhang
>> > > >>>> >> > > >> >> > > > > > > > >> >>
>> > > >>>> >> > > >> >> > > > > > > > >> >>
>> > > >>>> >> > > >> >> > > > > > > > >> >> On Thu, Sep 3, 2015 at 8:35 AM,
>> > > >>>> Jiangjie Qin
>> > > >>>> >> > > >> >> > > > > > > > <j...@linkedin.com.invalid
>> > > >>>> >> > > >> >> > > > > > > > >> >
>> > > >>>> >> > > >> >> > > > > > > > >> >> wrote:
>> > > >>>> >> > > >> >> > > > > > > > >> >>
>> > > >>>> >> > > >> >> > > > > > > > >> >> > Hi,
>> > > >>>> >> > > >> >> > > > > > > > >> >> >
>> > > >>>> >> > > >> >> > > > > > > > >> >> > We just created KIP-31 to
>> > propose a
>> > > >>>> message
>> > > >>>> >> > > format
>> > > >>>> >> > > >> >> > change
>> > > >>>> >> > > >> >> > > > in
>> > > >>>> >> > > >> >> > > > > > > Kafka.
>> > > >>>> >> > > >> >> > > > > > > > >> >> >
>> > > >>>> >> > > >> >> > > > > > > > >> >> >
>> > > >>>> >> > > >> >> > > > > > > > >> >> >
>> > > >>>> >> > > >> >> > > > > > > > >> >>
>> > > >>>> >> > > >> >> > > > > > > > >>
>> > > >>>> >> > > >> >> > > > > > > >
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > >
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> >
>> > > >>>> >> > > >> >>
>> > > >>>> >> > > >>
>> > > >>>> >> > >
>> > > >>>> >> >
>> > > >>>> >>
>> > > >>>>
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-31+-+Message+format+change+proposal
>> > > >>>> >> > > >> >> > > > > > > > >> >> >
>> > > >>>> >> > > >> >> > > > > > > > >> >> > As a summary, the motivations
>> > are:
>> > > >>>> >> > > >> >> > > > > > > > >> >> > 1. Avoid server side message
>> > > >>>> re-compression
>> > > >>>> >> > > >> >> > > > > > > > >> >> > 2. Honor time-based log roll
>> and
>> > > >>>> retention
>> > > >>>> >> > > >> >> > > > > > > > >> >> > 3. Enable offset search by
>> > > timestamp
>> > > >>>> at a
>> > > >>>> >> > finer
>> > > >>>> >> > > >> >> > > > granularity.
>> > > >>>> >> > > >> >> > > > > > > > >> >> >
>> > > >>>> >> > > >> >> > > > > > > > >> >> > Feedback and comments are
>> > welcome!
>> > > >>>> >> > > >> >> > > > > > > > >> >> >
>> > > >>>> >> > > >> >> > > > > > > > >> >> > Thanks,
>> > > >>>> >> > > >> >> > > > > > > > >> >> >
>> > > >>>> >> > > >> >> > > > > > > > >> >> > Jiangjie (Becket) Qin
>> > > >>>> >> > > >> >> > > > > > > > >> >> >
>> > > >>>> >> > > >> >> > > > > > > > >> >>
>> > > >>>> >> > > >> >> > > > > > > > >> >>
>> > > >>>> >> > > >> >> > > > > > > > >> >>
>> > > >>>> >> > > >> >> > > > > > > > >> >> --
>> > > >>>> >> > > >> >> > > > > > > > >> >> -- Guozhang
>> > > >>>> >> > > >> >> > > > > > > > >> >>
>> > > >>>> >> > > >> >> > > > > > > > >> >
>> > > >>>> >> > > >> >> > > > > > > > >> >
>> > > >>>> >> > > >> >> > > > > > > > >>
>> > > >>>> >> > > >> >> > > > > > > > >
>> > > >>>> >> > > >> >> > > > > > > > >
>> > > >>>> >> > > >> >> > > > > > > >
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > > > --
>> > > >>>> >> > > >> >> > > > > > > Thanks,
>> > > >>>> >> > > >> >> > > > > > > Ewen
>> > > >>>> >> > > >> >> > > > > > >
>> > > >>>> >> > > >> >> > > > > >
>> > > >>>> >> > > >> >> > > > >
>> > > >>>> >> > > >> >> > > >
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> > > --
>> > > >>>> >> > > >> >> > > Thanks,
>> > > >>>> >> > > >> >> > > Neha
>> > > >>>> >> > > >> >> > >
>> > > >>>> >> > > >> >> >
>> > > >>>> >> > > >> >>
>> > > >>>> >> > > >>
>> > > >>>> >> > >
>> > > >>>> >> >
>> > > >>>> >>
>> > > >>>> >>
>> > > >>>> >>
>> > > >>>> >> --
>> > > >>>> >> Thanks,
>> > > >>>> >> Ewen
>> > > >>>> >>
>> > > >>>>
>> > > >>>>
>> > > >>>
>> > > >>
>> > > >
>> > >
>> >
>>

Re: [DISCUSS] KIP-31 - Message format change proposal

Reply via email to