Re: [DISCUSS] KIP-258: Allow to Store Record Timestamps in RocksDB

Matthias J. Sax Thu, 09 Aug 2018 14:31:45 -0700

Thanks for your input Guozhang and John.

I see your point, that the upgrade API is not simple. If you don't
thinks it's valuable to make generic store upgrades possible (atm), we
can make the API internal, too. The impact is, that we only support a
predefined set up upgrades (ie, KV to KVwithTs, Windowed to
WindowedWithTS etc) for which we implement the internal interfaces.


We can keep the design generic, so if we decide to make it public, we
don't need to re-invent it. This will also have the advantage, that we
can add upgrade pattern for other stores later, too.

I also agree, that the `StoreUpgradeBuilder` is a little ugly, but it
was the only way I could find to design a generic upgrade interface. If
we decide the hide all the upgrade stuff, `StoreUpgradeBuilder` would
become an internal interface I guess (don't think we can remove it).

I will wait for more feedback about this and if nobody wants to keep it
as public API I will update the KIP accordingly. Will add some more
clarifications for different upgrade patterns in the mean time and fix
the typos/minor issues.

About adding a new state UPGRADING: maybe we could do that. However, I
find it particularly difficult to make the estimation when we should
switch to RUNNING, thus, I am a little hesitant. Using store callbacks
or just logging the progress including some indication about the "lag"
might actually be sufficient. Not sure what others think?

About "value before timestamp": no real reason and I think it does not
make any difference. Do you want to change it?

About upgrade robustness: yes, we cannot control if an instance fails.
That is what I meant by "we need to write test". The upgrade should be
able to continuous even is an instance goes down (and we must make sure
that we don't end up in an invalid state that forces us to wipe out the
whole store). Thus, we need to write system tests that fail instances
during upgrade.

For `in_place_offline` upgrade: I don't think we need this mode, because
people can do this via a single rolling bounce.

 - prepare code and switch KV-Store to KVwithTs-Store
 - do a single rolling bounce (don't set any upgrade config)

For this case, the `StoreUpgradeBuilder` (or `KVwithTs-Store` if we
remove the `StoreUpgradeBuilder`) will detect that there is only an old
local KV store w/o TS, will start to restore the new KVwithTs store,
wipe out the old store and replace with the new store after restore is
finished, and start processing only afterwards. (I guess we need to
document this case -- will also add it to the KIP.)



-Matthias



On 8/9/18 1:10 PM, John Roesler wrote:
> Hi Matthias,
> 
> I think this KIP is looking really good.
> 
> I have a few thoughts to add to the others:
> 
> 1. You mentioned at one point users needing to configure
> `upgrade.mode="null"`. I think this was a typo and you meant to say they
> should remove the config. If they really have to set it to a string "null"
> or even set it to a null value but not remove it, it would be unfortunate.
> 
> 2. In response to Bill's comment #1 , you said that "The idea is that the
> upgrade should be robust and not fail. We need to write according tests".
> I may have misunderstood the conversation, but I don't think it's within
> our power to say that an instance won't fail. What if one of my computers
> catches on fire? What if I'm deployed in the cloud and one instance
> disappears and is replaced by a new one? Or what if one instance goes AWOL
> for a long time and then suddenly returns? How will the upgrade process
> behave in light of such failures?
> 
> 3. your thought about making in-place an offline mode is interesting, but
> it might be a bummer for on-prem users who wish to upgrade online, but
> cannot just add new machines to the pool. It could be a new upgrade mode
> "offline-in-place", though...
> 
> 4. I was surprised to see that a user would need to modify the topology to
> do an upgrade (using StoreUpgradeBuilder). Maybe some of Guozhang's
> suggestions would remove this necessity.
> 
> Thanks for taking on this very complex but necessary work.
> 
> -John
> 
> On Thu, Aug 9, 2018 at 12:22 PM Guozhang Wang <[email protected]> wrote:
> 
>> Hello Matthias,
>>
>> Thanks for the updated KIP. Some more comments:
>>
>> 1. The current set of proposed API is a bit too complicated, which makes
>> the upgrade flow from user's perspective also a bit complex. I'd like to
>> check different APIs and discuss about their needs separately:
>>
>>     1.a. StoreProxy: needed for in-place upgrade only, between the first
>> and second rolling bounce, where the old-versioned stores can handle
>> new-versioned store APIs. I think such upgrade paths (i.e. from one store
>> type to another) would not be very common: users may want to upgrade from a
>> certain store engine to another, but the interface would likely be staying
>> the same. Hence personally I'd suggest we keep it internally and only
>> consider exposing it in the future if it does become a common pattern.
>>
>>     1.b. ConverterStore / RecordConverter: needed for both in-place and
>> roll-over upgrade, between the first and second rolling bounces, for the
>> new versioned store to be able to read old-versioned changelog topics.
>> Firstly I think we should not expose key in the public APIs but only the
>> values, since allowing key format changes would break log compaction, and
>> hence would not be compatible anyways. As for value format changes,
>> personally I think we can also keep its upgrade logic internally as it may
>> not worth generalizing to user customizable logic.
>>
>>     1.c. If you agrees with 2.a/b above, then we can also remove "
>> keyValueToKeyValueWithTimestampUpgradeStoreBuilder" from the public APIs.
>>
>>     1.d. Personally I think "ReadOnlyKeyValueWithTimestampStore" is not
>> needed either given that we are exposing "ValueAndTimestamp" anyways. I.e.
>> it is just a syntax sugar and for IQ, users can always just set a "
>> QueryableStoreType<ReadOnlyKeyValue<K, ValueAndTimestamp<V>>>" as the new
>> interface does not provide any additional functions.
>>
>>
>> 2. Could we further categorize the upgrade flow for different use cases,
>> e.g. 1) DSL users where KeyValueWithTimestampStore will be used
>> automatically for non-windowed aggregate; 2) PAPI users who do not need to
>> use KeyValueWithTimestampStore; 3) PAPI users who do want to switch to
>> KeyValueWithTimestampStore. Just to give my understanding for 3), the
>> upgrade flow for users may be simplified as the following (for both
>> in-place and roll-over):
>>
>>     * Update the jar to new version, make code changes from KeyValueStore
>> to KeyValueWithTimestampStore, set upgrade config.
>>
>>     * First rolling bounce, and library code can internally use proxy /
>> converter based on the specified config to handle new APIs with old stores,
>> while let new stores read from old changelog data.
>>
>>     * Reset upgrade config.
>>
>>     * Second rolling bounce, and the library code automatically turn off
>> logic for proxy / converter.
>>
>>
>> 3. Some more detailed proposals are needed for when to recommend users to
>> trigger the second rolling bounce. I have one idea to share here: we add a
>> new state to KafkaStreams, say UPGRADING, which is set when 1) upgrade
>> config is set, and 2) the new stores are still ramping up (for the second
>> part, we can start with some internal hard-coded heuristics to decide when
>> it is close to be ramped up). If either one of it is not true any more, it
>> should transit to RUNNING. Users can then watch on this state, and decide
>> to only trigger the second rebalance when the state has transited from
>> UPGRADING. They can also choose to cut over while the instance is still
>> UPGRADING, the downside is that after that the application may have long
>> restoration phase which is, to user's pov, unavailability periods.
>>
>>
>> Below are just some minor things on the wiki:
>>
>> 4. "proxy story" => "proxy store".
>>
>> 5. "use the a builder " => "use a builder"
>>
>> 6: "we add the record timestamp as a 8-byte (long) prefix to the value":
>> what's the rationale of putting the timestamp before the value, than after
>> the value?
>>
>>
>>
>> Guozhang
>>
>>
>> On Tue, Aug 7, 2018 at 5:13 PM, Matthias J. Sax <[email protected]>
>> wrote:
>>
>>> Thanks for the feedback Bill. I just update the KIP with some of your
>>> points.
>>>
>>>>> Regarding step 3C of the in-place upgrade (users needing to watch the
>>>>> restore process), I'm wondering if we want to provide a type of
>>>>> StateRestoreListener that could signal when the new stores have
>> reached
>>>>> parity with the existing old stores and that could be the signal to
>>> start
>>>>> second rolling rebalance?
>>>
>>> I think we can reuse the existing listeners, thus, I did not include
>>> anything in the KIP. About a signal to rebalance: this might be tricky.
>>> If we prepare the store "online", the active task will update the state
>>> continuously, and thus, state prepare is never finished. It will be the
>>> users responsibility to do the second rebalance (note, that the second
>>> rebalance will first finish the last delta of the upgrade to finish the
>>> upgrade before actual processing resumes). I clarified the KIP with this
>>> regard a little bit.
>>>
>>>>> 1. Out of N instances, one fails midway through the process, would we
>>> allow
>>>>> the other instances to complete or just fail the entire upgrade?
>>>
>>> The idea is that the upgrade should be robust and not fail. We need to
>>> write according tests.
>>>
>>>>> 2. During the second rolling bounce, maybe we could rename the current
>>>>> active directories vs. deleting them right away,  and when all the
>>> prepare
>>>>> task directories are successfully migrated then delete the previous
>>> active
>>>>> ones.
>>>
>>> Ack. Updated the KIP.
>>>
>>>>> 3. For the first rolling bounce we pause any processing any new
>> records
>>> and
>>>>> just allow the prepare tasks to restore, then once all prepare tasks
>>> have
>>>>> restored, it's a signal for the second round of rolling bounces and
>>> then as
>>>>> each task successfully renames its prepare directories and deletes the
>>> old
>>>>> active task directories, normal processing of records resumes.
>>>
>>> The basic idea is to do an online upgrade to avoid downtime. We can
>>> discuss to offer both options... For the offline upgrade option, we
>>> could simplify user interaction and trigger the second rebalance
>>> automatically with the requirement that a user needs to update any
>> config.
>>>
>>> If might actually be worth to include this option: we know from
>>> experience with state restore, that regular processing slows down the
>>> restore. For roll_over upgrade, it would be a different story and
>>> upgrade should not be slowed down by regular processing. Thus, we should
>>> even make in_place an offline upgrade and force people to use roll_over
>>> if they need onlint upgrade. Might be a fair tradeoff that may simplify
>>> the upgrade for the user and for the code complexity.
>>>
>>> Let's see what other think.
>>>
>>>
>>> -Matthias
>>>
>>>
>>> On 7/27/18 12:53 PM, Bill Bejeck wrote:
>>>> Hi Matthias,
>>>>
>>>> Thanks for the update and the working prototype, it helps with
>>>> understanding the KIP.
>>>>
>>>> I took an initial pass over this PR, and overall I find the interfaces
>>> and
>>>> approach to be reasonable.
>>>>
>>>> Regarding step 3C of the in-place upgrade (users needing to watch the
>>>> restore process), I'm wondering if we want to provide a type of
>>>> StateRestoreListener that could signal when the new stores have reached
>>>> parity with the existing old stores and that could be the signal to
>> start
>>>> second rolling rebalance?
>>>>
>>>> Although you solicited feedback on the interfaces involved, I wanted to
>>> put
>>>> down some thoughts that have come to mind reviewing this KIP again
>>>>
>>>> 1. Out of N instances, one fails midway through the process, would we
>>> allow
>>>> the other instances to complete or just fail the entire upgrade?
>>>> 2. During the second rolling bounce, maybe we could rename the current
>>>> active directories vs. deleting them right away,  and when all the
>>> prepare
>>>> task directories are successfully migrated then delete the previous
>>> active
>>>> ones.
>>>> 3. For the first rolling bounce we pause any processing any new records
>>> and
>>>> just allow the prepare tasks to restore, then once all prepare tasks
>> have
>>>> restored, it's a signal for the second round of rolling bounces and
>> then
>>> as
>>>> each task successfully renames its prepare directories and deletes the
>>> old
>>>> active task directories, normal processing of records resumes.
>>>>
>>>> Thanks,
>>>> Bill
>>>>
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 9:42 PM Matthias J. Sax <[email protected]
>>>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> KIP-268 (rebalance meatadata) is finished and included in AK 2.0
>>>>> release. Thus, I want to pick up this KIP again to get the RocksDB
>>>>> upgrade done for 2.1.
>>>>>
>>>>> I updated the KIP accordingly and also have a "prove of concept" PR
>>>>> ready (for "in place" upgrade only):
>>>>> https://github.com/apache/kafka/pull/5422/
>>>>>
>>>>> There a still open questions, but I want to collect early feedback on
>>>>> the proposed interfaces we need for the store upgrade. Also note, that
>>>>> the KIP now also aim to define a generic upgrade path from any store
>>>>> format A to any other store format B. Adding timestamps is just a
>>>>> special case.
>>>>>
>>>>> I will continue to work on the PR and refine the KIP in the meantime,
>>> too.
>>>>>
>>>>> Looking forward to your feedback.
>>>>>
>>>>> -Matthias
>>>>>
>>>>>
>>>>> On 3/14/18 11:14 PM, Matthias J. Sax wrote:
>>>>>> After some more thoughts, I want to follow John's suggestion and
>> split
>>>>>> upgrading the rebalance metadata from the store upgrade.
>>>>>>
>>>>>> I extracted the metadata upgrade into it's own KIP:
>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-268%
>>> 3A+Simplify+Kafka+Streams+Rebalance+Metadata+Upgrade
>>>>>>
>>>>>> I'll update this KIP accordingly shortly. I also want to consider to
>>>>>> make the store format upgrade more flexible/generic. Atm, the KIP is
>>> too
>>>>>> much tailored to the DSL IMHO and does not encounter PAPI users that
>> we
>>>>>> should not force to upgrade the stores. I need to figure out the
>>> details
>>>>>> and follow up later.
>>>>>>
>>>>>> Please give feedback for the new KIP-268 on the corresponding
>>> discussion
>>>>>> thread.
>>>>>>
>>>>>> @James: unfortunately, for upgrading to 1.2 I couldn't figure out a
>> way
>>>>>> for a single rolling bounce upgrade. But KIP-268 proposes a fix for
>>>>>> future upgrades. Please share your thoughts.
>>>>>>
>>>>>> Thanks for all your feedback!
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>> On 3/12/18 11:56 PM, Matthias J. Sax wrote:
>>>>>>> @John: yes, we would throw if configs are missing (it's an
>>>>>>> implementation details IMHO and thus I did not include it in the
>> KIP)
>>>>>>>
>>>>>>> @Guozhang:
>>>>>>>
>>>>>>> 1) I understand know what you mean. We can certainly, allow all
>> values
>>>>>>> "0.10.0.x", "0.10.1.x", "0.10.2.x", ... "1.1.x" for `upgrade.from`
>>>>>>> parameter. I had a similar though once but decided to collapse them
>>> into
>>>>>>> one -- will update the KIP accordingly.
>>>>>>>
>>>>>>> 2) The idea to avoid any config would be, to always send both
>> request.
>>>>>>> If we add a config to eventually disable the old request, we don't
>>> gain
>>>>>>> anything with this approach. The question is really, if we are
>> willing
>>>>>>> to pay this overhead from 1.2 on -- note, it would be limited to 2
>>>>>>> versions and not grow further in future releases. More details in
>> (3)
>>>>>>>
>>>>>>> 3) Yes, this approach subsumes (2) for later releases and allows us
>> to
>>>>>>> stay with 2 "assignment strategies" we need to register, as the new
>>>>>>> assignment strategy will allow to "upgrade itself" via "version
>>>>>>> probing". Thus, (2) would only be a workaround to avoid a config if
>>>>>>> people upgrade from pre-1.2 releases.
>>>>>>>
>>>>>>> Thus, I don't think we need to register new "assignment strategies"
>>> and
>>>>>>> send empty subscriptions for older version.
>>>>>>>
>>>>>>> 4) I agree that this is a tricky thing to get right with a single
>>>>>>> rebalance. I share the concern that an application might never catch
>>> up
>>>>>>> and thus the hot standby will never be ready.
>>>>>>>
>>>>>>> Maybe it's better to go with 2 rebalances for store upgrades. If we
>> do
>>>>>>> this, we also don't need to go with (2) and can get (3) in place for
>>>>>>> future upgrades. I also think that changes to the metadata are more
>>>>>>> likely and thus allowing for single rolling bounce for this case is
>>> more
>>>>>>> important anyway. If we assume that store upgrade a rare, it might
>> be
>>> ok
>>>>>>> to sacrifice two rolling bounced for this case. It was just an idea
>> I
>>>>>>> wanted to share (even if I see the issues).
>>>>>>>
>>>>>>>
>>>>>>> -Matthias
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 3/12/18 11:45 AM, Guozhang Wang wrote:
>>>>>>>> Hello Matthias, thanks for your replies.
>>>>>>>>
>>>>>>>>
>>>>>>>> 1) About the config names: actually I was trying to not expose
>>>>>>>> implementation details :) My main concern was that in your proposal
>>> the
>>>>>>>> values need to cover the span of all the versions that are actually
>>>>> using
>>>>>>>> the same version, i.e. "0.10.1.x-1.1.x". So if I (as a user) am
>>>>> upgrading
>>>>>>>> from any versions within this range I need to remember to use the
>>> value
>>>>>>>> "0.10.1.x-1.1.x" than just specifying my old version. In my
>>> suggestion
>>>>> I
>>>>>>>> was trying to argue the benefit of just letting users to specify
>> the
>>>>> actual
>>>>>>>> Kafka version she's trying to upgrade from, than specifying a range
>>> of
>>>>>>>> versions. I was not suggesting to use "v1, v2, v3" etc as the
>> values,
>>>>> but
>>>>>>>> still using Kafka versions like broker's `internal.version` config.
>>>>> But if
>>>>>>>> you were suggesting the same thing, i.e. by "0.10.1.x-1.1.x" you
>>> meant
>>>>> to
>>>>>>>> say users can just specify "0.10.1" or "0.10.2" or "0.11.0" or
>> "1.1"
>>>>> which
>>>>>>>> are all recognizable config values then I think we are actually on
>>> the
>>>>> same
>>>>>>>> page.
>>>>>>>>
>>>>>>>> 2) About the "multi-assignment" idea: yes it would increase the
>>> network
>>>>>>>> footprint, but not the message size, IF I'm not mis-understanding
>>> your
>>>>> idea
>>>>>>>> of registering multiple assignment. More details:
>>>>>>>>
>>>>>>>> In the JoinGroupRequest, in the protocols field we can encode
>>> multiple
>>>>>>>> protocols each with their different metadata. The coordinator will
>>>>> pick the
>>>>>>>> common one that everyone supports (if there are no common one, it
>>> will
>>>>> send
>>>>>>>> an error back; if there are multiple ones, it will pick the one
>> with
>>>>> most
>>>>>>>> votes, i.e. the one which was earlier in the encoded list). Since
>> our
>>>>>>>> current Streams rebalance protocol is still based on the consumer
>>>>>>>> coordinator, it means our protocol_type would be "consumer", but
>>>>> instead
>>>>>>>> the protocol type we can have multiple protocols like "streams",
>>>>>>>> "streams_v2", "streams_v3" etc. The downside is that we need to
>>>>> implement a
>>>>>>>> different assignor class for each version and register all of them
>> in
>>>>>>>> consumer's PARTITION_ASSIGNMENT_STRATEGY_CONFIG. In the future if
>> we
>>>>>>>> re-factor our implementation to have our own client coordinator
>> layer
>>>>> like
>>>>>>>> Connect did, we can simplify this part of the implementation. But
>>> even
>>>>> for
>>>>>>>> now with the above approach this is still doable.
>>>>>>>>
>>>>>>>> On the broker side, the group coordinator will only persist a group
>>>>> with
>>>>>>>> the selected protocol and its subscription metadata, e.g. if
>>>>> coordinator
>>>>>>>> decides to pick "streams_v2" it will only sends that protocol's
>>>>> metadata
>>>>>>>> from everyone to the leader to assign, AND when completing the
>>>>> rebalance it
>>>>>>>> will also only write the group metadata with that protocol and the
>>>>>>>> assignment only. In a word, although the network traffic maybe
>>>>> increased a
>>>>>>>> bit, it would not be a bummer in our trade-off. One corner
>> situation
>>> we
>>>>>>>> need to consider is how to stop registering very old assignors to
>>>>> avoid the
>>>>>>>> network traffic from increasing indefinitely, e.g. if you are
>> rolling
>>>>>>>> bounce from v2 to v3, then you'd not need to register v1 assignor
>>>>> anymore,
>>>>>>>> but that would unfortunately still require some configs.
>>>>>>>>
>>>>>>>> 3) About the  "version probing" idea, I think that's a promising
>>>>> approach
>>>>>>>> as well, but if we are going to do the multi-assignment its value
>>> seems
>>>>>>>> subsumed? But I'm thinking maybe it can be added on top of
>>>>> multi-assignment
>>>>>>>> to save us from still requiring the config to avoid registering all
>>> the
>>>>>>>> metadata for all version. More details:
>>>>>>>>
>>>>>>>> In the JoinGroupRequest, we still register all the assignor but for
>>>>> all old
>>>>>>>> assignors we do not encode any metadata, i.e. the encoded data
>> would
>>>>> be:
>>>>>>>>
>>>>>>>> "streams_vN" : "encoded metadata"
>>>>>>>> "streams_vN-1":empty
>>>>>>>> "streams_vN-2":empty
>>>>>>>> ..
>>>>>>>> "streams_0":empty
>>>>>>>>
>>>>>>>> So the coordinator can still safely choose the latest common
>> version;
>>>>> and
>>>>>>>> then when leaders receive the subscription (note it should always
>>>>> recognize
>>>>>>>> that version), let's say it is streams_vN-2, if one of the
>>>>> subscriptions
>>>>>>>> are empty bytes, it will send the empty assignment with that
>> version
>>>>> number
>>>>>>>> encoded in the metadata. So in the second auto-triggered all
>> members
>>>>> would
>>>>>>>> send the metadata with that version:
>>>>>>>>
>>>>>>>> "streams_vN" : empty
>>>>>>>> "streams_vN-1" : empty
>>>>>>>> "streams_vN-2" : "encoded metadata"
>>>>>>>> ..
>>>>>>>> "streams_0":empty
>>>>>>>>
>>>>>>>>
>>>>>>>> By doing this we would not require any configs for users.
>>>>>>>>
>>>>>>>>
>>>>>>>> 4) About the "in_place" upgrade on rocksDB, I'm not clear about the
>>>>> details
>>>>>>>> so probably we'd need to fill that out before making a call. For
>>>>> example,
>>>>>>>> you mentioned "If we detect this situation, the Streams application
>>>>> closes
>>>>>>>> corresponding active tasks as well as "hot standby" tasks, and
>>>>> re-creates
>>>>>>>> the new active tasks using the new store." How could we guarantee
>>> that
>>>>> the
>>>>>>>> gap between these two stores will keep decreasing than increasing
>> so
>>>>> we'll
>>>>>>>> eventually achieve the flip point? And also the longer we are
>> before
>>>>> the
>>>>>>>> flip point, the larger we are doubling the storage space, etc.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 11, 2018 at 4:06 PM, Matthias J. Sax <
>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> @John, Guozhang,
>>>>>>>>>
>>>>>>>>> thanks a lot for your comments. Very long reply...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> About upgrading the rebalance metadata:
>>>>>>>>>
>>>>>>>>> Another possibility to do this, would be to register multiple
>>>>> assignment
>>>>>>>>> strategies for the 1.2 applications. For this case, new instances
>>>>> would
>>>>>>>>> be configured to support both and the broker would pick the
>> version
>>>>> that
>>>>>>>>> all instances understand. The disadvantage would be, that we send
>>> much
>>>>>>>>> more data (ie, two subscriptions) in each rebalance as long as no
>>>>> second
>>>>>>>>> rebalance is done disabling the old protocol. Thus, using this
>>>>> approach
>>>>>>>>> would allow to avoid a second rebalance trading-off an increased
>>>>>>>>> rebalance network footprint (I also assume that this would
>> increase
>>>>> the
>>>>>>>>> message size that is written into __consumer_offsets topic?).
>>>>> Overall, I
>>>>>>>>> am not sure if this would be a good tradeoff, but it could avoid a
>>>>>>>>> second rebalance (I have some more thoughts about stores below
>> that
>>>>> are
>>>>>>>>> relevant for single rebalance upgrade).
>>>>>>>>>
>>>>>>>>> For future upgrades we might be able to fix this though. I was
>>>>> thinking
>>>>>>>>> about the following:
>>>>>>>>>
>>>>>>>>> In the current implementation, the leader fails if it gets a
>>>>>>>>> subscription it does not understand (ie, newer version). We could
>>>>> change
>>>>>>>>> this behavior and let the leader send an empty assignment plus
>> error
>>>>>>>>> code (including supported version) back to the instance sending
>> the
>>>>>>>>> "bad" subscription. This would allow the following logic for an
>>>>>>>>> application instance:
>>>>>>>>>
>>>>>>>>>  - on startup, always send the latest subscription format
>>>>>>>>>  - if leader understands it, we get an assignment back an start
>>>>> processing
>>>>>>>>>  - if leader does not understand it, we get an empty assignment
>> and
>>>>>>>>> supported version back
>>>>>>>>>  - the application unsubscribe()/subscribe()/poll() again and
>>> sends a
>>>>>>>>> subscription using the leader's supported version
>>>>>>>>>
>>>>>>>>> This protocol would allow to do a single rolling bounce, and
>>>>> implements
>>>>>>>>> a "version probing" step, that might result in two executed
>>>>> rebalances.
>>>>>>>>> The advantage would be, that the user does not need to set any
>>> configs
>>>>>>>>> or do multiple rolling bounces, as Streams takes care of this
>>>>>>>>> automatically.
>>>>>>>>>
>>>>>>>>> One disadvantage would be, that two rebalances happen and that for
>>> an
>>>>>>>>> error case during rebalance, we loose the information about the
>>>>>>>>> supported leader version and the "probing step" would happen a
>>> second
>>>>> time.
>>>>>>>>>
>>>>>>>>> If the leader is eventually updated, it will include it's own
>>>>> supported
>>>>>>>>> version in all assignments, to allow a "down graded" application
>> to
>>>>>>>>> upgrade its version later. Also, if a application fails, the first
>>>>>>>>> probing would always be successful and only a single rebalance
>>>>> happens.
>>>>>>>>> If we use this protocol, I think we don't need any configuration
>>>>>>>>> parameter for future upgrades.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> About "upgrade.from" vs "internal.protocol.version":
>>>>>>>>>
>>>>>>>>> Users would set "upgrade.from" to the release version the
>>> current/old
>>>>>>>>> application is using. I think this is simpler, as users know this
>>>>>>>>> version. If we use "internal.protocol.version" instead, we expose
>>>>>>>>> implementation details and users need to know the protocol version
>>>>> (ie,
>>>>>>>>> they need to map from the release version to the protocol version;
>>> ie,
>>>>>>>>> "I am run 0.11.0 that runs with metadata protocol version 2").
>>>>>>>>>
>>>>>>>>> Also the KIP states that for the second rolling bounce, the
>>>>>>>>> "upgrade.mode" config should be set back to `null` -- and thus,
>>>>>>>>> "upgrade.from" would not have any effect and is ignored (I will
>>> update
>>>>>>>>> the KIP to point out this dependency).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> About your second point: I'll update the KIP accordingly to
>> describe
>>>>>>>>> future updates as well. Both will be different.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> One more point about upgrading the store format. I was thinking
>>> about
>>>>>>>>> avoiding the second rolling bounce all together in the future: (1)
>>> the
>>>>>>>>> goal is to achieve an upgrade with zero downtime (2) this required
>>> to
>>>>>>>>> prepare the stores as "hot standbys" before we do the switch and
>>>>> delete
>>>>>>>>> the old stores. (3) the current proposal does the switch
>> "globally"
>>> --
>>>>>>>>> this is simpler and due to the required second rebalance no
>>>>> disadvantage.
>>>>>>>>> However, a global consistent switch over might actually not be
>>>>> required.
>>>>>>>>> For "in_place" upgrade, following the protocol from above, we
>> could
>>>>>>>>> decouple the store switch and each instance could switch its store
>>>>>>>>> independently from all other instances. After the rolling bounce,
>> it
>>>>>>>>> seems to be ok to switch from the old store to the new store
>> "under
>>>>> the
>>>>>>>>> hood" whenever the new store is ready (this could even be done,
>>> before
>>>>>>>>> we switch to the new metadata version). Each time we update the
>> "hot
>>>>>>>>> standby" we check if it reached the "endOffset"  (or maybe X% that
>>>>> could
>>>>>>>>> either be hardcoded or configurable). If we detect this situation,
>>> the
>>>>>>>>> Streams application closes corresponding active tasks as well as
>>> "hot
>>>>>>>>> standby" tasks, and re-creates the new active tasks using the new
>>>>> store.
>>>>>>>>> (I need to go through the details once again, but it seems to be
>>>>>>>>> feasible.).
>>>>>>>>>
>>>>>>>>> Combining this strategy with the "multiple assignment" idea, might
>>>>> even
>>>>>>>>> enable us to do an single rolling bounce upgrade from 1.1 -> 1.2.
>>>>>>>>> Applications would just use the old store, as long as the new
>> store
>>> is
>>>>>>>>> not ready, even if the new metadata version is used already.
>>>>>>>>>
>>>>>>>>> For future upgrades, a single rebalance would be sufficient, too,
>>> even
>>>>>>>>> if the stores are upgraded. We would not need any config
>> parameters
>>> as
>>>>>>>>> the "probe" step allows us to detect the supported rebalance
>>> metadata
>>>>>>>>> version (and we would also not need multiple "assigmnent
>> strategies"
>>>>> as
>>>>>>>>> out own protocol encoded everything we need).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Let me know what you think.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/9/18 10:33 PM, Guozhang Wang wrote:
>>>>>>>>>> @John:
>>>>>>>>>>
>>>>>>>>>> For the protocol version upgrade, it is only for the encoded
>>> metadata
>>>>>>>>> bytes
>>>>>>>>>> protocol, which are just bytes-in bytes-out from Consumer's pov,
>>> so I
>>>>>>>>> think
>>>>>>>>>> this change should be in the Streams layer as well.
>>>>>>>>>>
>>>>>>>>>> @Matthias:
>>>>>>>>>>
>>>>>>>>>> for 2), I agree that adding a "newest supported version" besides
>>> the
>>>>>>>>>> "currently used version for encoding" is a good idea to allow
>>> either
>>>>>>>>> case;
>>>>>>>>>> the key is that in Streams we would likely end up with a mapping
>>>>> from the
>>>>>>>>>> protocol version to the other persistent data format versions
>> such
>>> as
>>>>>>>>>> rocksDB, changelog. So with such a map we can actually achieve
>> both
>>>>>>>>>> scenarios, i.e. 1) one rolling bounce if the upgraded protocol
>>>>> version's
>>>>>>>>>> corresponding data format does not change, e.g. 0.10.0 -> 0.10.1
>>>>> leaders
>>>>>>>>>> can choose to use the newer version in the first rolling bounce
>>>>> directly
>>>>>>>>>> and we can document to users that they would not need to set
>>>>>>>>>> "upgrade.mode", and 2) two rolling bounce if the upgraded
>> protocol
>>>>>>>>> version
>>>>>>>>>> does indicate the data format changes, e.g. 1.1 -> 1.2, and then
>> we
>>>>> can
>>>>>>>>>> document that "upgrade.mode" needs to be set in the first rolling
>>>>> bounce
>>>>>>>>>> and reset in the second.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Besides that, some additional comments:
>>>>>>>>>>
>>>>>>>>>> 1) I still think "upgrade.from" is less intuitive for users to
>> set
>>>>> than
>>>>>>>>>> "internal.protocol.version" where for the latter users only need
>> to
>>>>> set a
>>>>>>>>>> single version, while the Streams will map that version to the
>>>>> Streams
>>>>>>>>>> assignor's behavior as well as the data format. But maybe I did
>> not
>>>>> get
>>>>>>>>>> your idea about how the  "upgrade.from" config will be set,
>> because
>>>>> in
>>>>>>>>>> your Compatibility section how the upgrade.from config will be
>> set
>>>>> for
>>>>>>>>>> these two rolling bounces are not very clear: for example, should
>>>>> user
>>>>>>>>>> reset it to null in the second rolling bounce?
>>>>>>>>>>
>>>>>>>>>> 2) In the upgrade path description, rather than talking about
>>>>> specific
>>>>>>>>>> version 0.10.0 -> version 0.10.1 etc, can we just categorize all
>>> the
>>>>>>>>>> possible scenarios, even for future upgrade versions, what should
>>> be
>>>>> the
>>>>>>>>>> standard operations? The categorized we can summarize to would be
>>>>>>>>> (assuming
>>>>>>>>>> user upgrade from version X to version Y, where X and Y are Kafka
>>>>>>>>> versions,
>>>>>>>>>> with the corresponding supported protocol version x and y):
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> a. x == y, i.e. metadata protocol does not change, and hence no
>>>>>>>>> persistent
>>>>>>>>>> data formats have changed.
>>>>>>>>>>
>>>>>>>>>> b. x != y, but all persistent data format remains the same.
>>>>>>>>>>
>>>>>>>>>> b. x !=y, AND some persistene data format like RocksDB format,
>>>>> changelog
>>>>>>>>>> format, has been changed.
>>>>>>>>>>
>>>>>>>>>> c. special case: we may need some special handling logic when
>>>>> "current
>>>>>>>>>> version" or "newest supported version" are not available in the
>>>>> protocol,
>>>>>>>>>> i.e. for X as old as 0.10.0 and before 1.2.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> under the above scenarios, how many rolling bounces users need to
>>>>>>>>> execute?
>>>>>>>>>> how they should set the configs in each rolling bounce? and how
>>>>> Streams
>>>>>>>>>> library will execute in these cases?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Guozhang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 9, 2018 at 4:01 PM, Matthias J. Sax <
>>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ted,
>>>>>>>>>>>
>>>>>>>>>>> I still consider changing the KIP to include it right away -- if
>>>>> not,
>>>>>>>>>>> I'll create a JIRA. Need to think it through in more detail
>> first.
>>>>>>>>>>>
>>>>>>>>>>> (Same for other open questions like interface names -- I collect
>>>>>>>>>>> feedback and update the KIP after we reach consensus :))
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>> On 3/9/18 3:35 PM, Ted Yu wrote:
>>>>>>>>>>>> Thanks for the details, Matthias.
>>>>>>>>>>>>
>>>>>>>>>>>> bq. change the metadata protocol only if a future release,
>>> encoding
>>>>>>>>> both
>>>>>>>>>>> used
>>>>>>>>>>>> and supported version might be an advantage
>>>>>>>>>>>>
>>>>>>>>>>>> Looks like encoding both versions wouldn't be implemented in
>> this
>>>>> KIP.
>>>>>>>>>>>>
>>>>>>>>>>>> Please consider logging a JIRA with the encoding proposal.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Mar 9, 2018 at 2:27 PM, Matthias J. Sax <
>>>>> [email protected]
>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Bill: I think a filter predicate should be part of user code.
>>> And
>>>>>>>>> even
>>>>>>>>>>>>> if we want to add something like this, I would prefer to do it
>>> in
>>>>> a
>>>>>>>>>>>>> separate KIP.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> @James: I would love to avoid a second rolling bounce. But
>> from
>>> my
>>>>>>>>>>>>> understanding it would not be possible.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The purpose of the second rolling bounce is indeed to switch
>>> from
>>>>>>>>>>>>> version 2 to 3. It also has a second purpose, to switch from
>> the
>>>>> old
>>>>>>>>>>>>> store to the new store (this happens after the last instance
>>>>> bounces a
>>>>>>>>>>>>> second time).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem with one round of rolling bounces is, that it's
>>>>> unclear
>>>>>>>>> when
>>>>>>>>>>>>> to which from version 2 to version 3. The
>>>>> StreamsPartitionsAssignor is
>>>>>>>>>>>>> stateless by design, and thus, the information which version
>> it
>>>>> should
>>>>>>>>>>>>> use must be passed in from externally -- and we want to use
>> the
>>>>>>>>>>>>> StreamsConfig to pass in this information.
>>>>>>>>>>>>>
>>>>>>>>>>>>> During upgrade, all new instanced have no information about
>> the
>>>>>>>>> progress
>>>>>>>>>>>>> of the upgrade (ie, how many other instanced got upgrades
>>>>> already).
>>>>>>>>>>>>> Therefore, it's not safe for them to send a version 3
>>>>> subscription.
>>>>>>>>> The
>>>>>>>>>>>>> leader also has this limited view on the world and can only
>> send
>>>>>>>>> version
>>>>>>>>>>>>> 2 assignments back.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thus, for the 1.2 upgrade, I don't think we can simplify the
>>>>> upgrade.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We did consider to change the metadata to make later upgrades
>>> (ie,
>>>>>>>>> from
>>>>>>>>>>>>> 1.2 to 1.x) simpler though (for the case we change the
>> metadata
>>> or
>>>>>>>>>>>>> storage format again -- as long as we don't change it, a
>> single
>>>>>>>>> rolling
>>>>>>>>>>>>> bounce is sufficient), by encoding "used version" and
>> "supported
>>>>>>>>>>>>> version". This would allow the leader to switch to the new
>>> version
>>>>>>>>>>>>> earlier and without a second rebalance: leader would receive
>>> "used
>>>>>>>>>>>>> version == old" and "supported version = old/new" -- as long
>> as
>>> at
>>>>>>>>> least
>>>>>>>>>>>>> one instance sends a "supported version = old" leader sends
>> old
>>>>>>>>> version
>>>>>>>>>>>>> assignment back. However, encoding both version would allow
>> that
>>>>> the
>>>>>>>>>>>>> leader can send a new version assignment back, right after the
>>>>> first
>>>>>>>>>>>>> round or rebalance finished (all instances send "supported
>>>>> version =
>>>>>>>>>>>>> new"). However, there are still two issues with this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) if we switch to the new format right after the last
>> instance
>>>>>>>>> bounced,
>>>>>>>>>>>>> the new stores might not be ready to be used -- this could
>> lead
>>> to
>>>>>>>>>>>>> "downtime" as store must be restored before processing can
>>> resume.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2) Assume an instance fails and is restarted again. At this
>>>>> point, the
>>>>>>>>>>>>> instance will still have "upgrade mode" enabled and thus sends
>>>>> the old
>>>>>>>>>>>>> protocol data. However, it would be desirable to never fall
>> back
>>>>> to
>>>>>>>>> the
>>>>>>>>>>>>> old protocol after the switch to the new protocol.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The second issue is minor and I guess if users set-up the
>>> instance
>>>>>>>>>>>>> properly it could be avoided. However, the first issue would
>>>>> prevent
>>>>>>>>>>>>> "zero downtime" upgrades. Having said this, if we consider
>> that
>>> we
>>>>>>>>> might
>>>>>>>>>>>>> change the metadata protocol only if a future release,
>> encoding
>>>>> both
>>>>>>>>>>>>> used and supported version might be an advantage in the future
>>>>> and we
>>>>>>>>>>>>> could consider to add this information in 1.2 release to
>> prepare
>>>>> for
>>>>>>>>>>> this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Btw: monitoring the log, is also only required to give the
>>>>> instances
>>>>>>>>>>>>> enough time to prepare the stores in new format. If you would
>> do
>>>>> the
>>>>>>>>>>>>> second rolling bounce before this, it would still work --
>>>>> however, you
>>>>>>>>>>>>> might see app "downtime" as the new store must be fully
>> restored
>>>>>>>>> before
>>>>>>>>>>>>> processing can resume.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does this make sense?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 3/9/18 11:36 AM, James Cheng wrote:
>>>>>>>>>>>>>> Matthias,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For all the upgrade paths, is it possible to get rid of the
>> 2nd
>>>>>>>>> rolling
>>>>>>>>>>>>> bounce?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For the in-place upgrade, it seems like primary difference
>>>>> between
>>>>>>>>> the
>>>>>>>>>>>>> 1st rolling bounce and the 2nd rolling bounce is to decide
>>>>> whether to
>>>>>>>>>>> send
>>>>>>>>>>>>> Subscription Version 2 or Subscription Version 3.  (Actually,
>>>>> there is
>>>>>>>>>>>>> another difference mentioned in that the KIP says that the 2nd
>>>>> rolling
>>>>>>>>>>>>> bounce should happen after all new state stores are created by
>>> the
>>>>>>>>>>>>> background thread. However, within the 2nd rolling bounce, we
>>> say
>>>>> that
>>>>>>>>>>>>> there is still a background thread, so it seems like is no
>>> actual
>>>>>>>>>>>>> requirement to wait for the new state stores to be created.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The 2nd rolling bounce already knows how to deal with
>>> mixed-mode
>>>>>>>>>>> (having
>>>>>>>>>>>>> both Version 2 and Version 3 in the same consumer group). It
>>> seems
>>>>>>>>> like
>>>>>>>>>>> we
>>>>>>>>>>>>> could get rid of the 2nd bounce if we added logic
>>>>> (somehow/somewhere)
>>>>>>>>>>> such
>>>>>>>>>>>>> that:
>>>>>>>>>>>>>> * Instances send Subscription Version 2 until all instances
>> are
>>>>>>>>> running
>>>>>>>>>>>>> the new code.
>>>>>>>>>>>>>> * Once all the instances are running the new code, then one
>> at
>>> a
>>>>>>>>> time,
>>>>>>>>>>>>> the instances start sending Subscription V3. Leader still
>> hands
>>>>> out
>>>>>>>>>>>>> Assignment Version 2, until all new state stores are ready.
>>>>>>>>>>>>>> * Once all instances report that new stores are ready, Leader
>>>>> sends
>>>>>>>>> out
>>>>>>>>>>>>> Assignment Version 3.
>>>>>>>>>>>>>> * Once an instance receives an Assignment Version 3, it can
>>>>> delete
>>>>>>>>> the
>>>>>>>>>>>>> old state store.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Doing it that way seems like it would reduce a lot of
>>>>>>>>>>>>> operator/deployment overhead. No need to do 2 rolling
>> restarts.
>>> No
>>>>>>>>> need
>>>>>>>>>>> to
>>>>>>>>>>>>> monitor logs for state store rebuild. You just deploy it, and
>>> the
>>>>>>>>>>> instances
>>>>>>>>>>>>> update themselves.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The thing that made me think of this is that the "2 rolling
>>>>> bounces"
>>>>>>>>> is
>>>>>>>>>>>>> similar to what Kafka brokers have to do changes in
>>>>>>>>>>>>> inter.broker.protocol.version and log.message.format.version.
>>> And
>>>>> in
>>>>>>>>> the
>>>>>>>>>>>>> broker case, it seems like it would be possible (with some
>> work
>>> of
>>>>>>>>>>> course)
>>>>>>>>>>>>> to modify kafka to allow us to do similar auto-detection of
>>> broker
>>>>>>>>>>>>> capabilities and automatically do a switchover from old/new
>>>>> versions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -James
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mar 9, 2018, at 10:38 AM, Bill Bejeck <[email protected]
>>>
>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Matthias,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the KIP, it's a +1 from me.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I do have one question regarding the retrieval methods on
>> the
>>>>> new
>>>>>>>>>>>>>>> interfaces.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Would want to consider adding one method with a Predicate
>> that
>>>>> would
>>>>>>>>>>>>> allow
>>>>>>>>>>>>>>> for filtering records by the timestamp stored with the
>> record?
>>>>> Or
>>>>>>>>> is
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> better left for users to implement themselves once the data
>>> has
>>>>> been
>>>>>>>>>>>>>>> retrieved?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Mar 8, 2018 at 7:14 PM, Ted Yu <[email protected]
>>>
>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Matthias:
>>>>>>>>>>>>>>>> For my point #1, I don't have preference as to which
>>> separator
>>>>> is
>>>>>>>>>>>>> chosen.
>>>>>>>>>>>>>>>> Given the background you mentioned, current choice is good.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For #2, I think my proposal is better since it is closer to
>>>>> English
>>>>>>>>>>>>>>>> grammar.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Would be good to listen to what other people think.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Mar 8, 2018 at 4:02 PM, Matthias J. Sax <
>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for the comments!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Guozhang:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So far, there is one PR for the rebalance metadata upgrade
>>> fix
>>>>>>>>>>>>>>>>> (addressing the mentioned
>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-6054) It
>> give a
>>>>> first
>>>>>>>>>>>>>>>>> impression how the metadata upgrade works including a
>> system
>>>>> test:
>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/4636
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I can share other PRs as soon as they are ready. I agree
>>> that
>>>>> the
>>>>>>>>>>> KIP
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> complex am I ok with putting out more code to give better
>>>>>>>>> discussion
>>>>>>>>>>>>>>>>> context.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Ted:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I picked `_` instead of `-` to align with the
>>>>>>>>> `processing.guarantee`
>>>>>>>>>>>>>>>>> parameter that accepts `at_least_one` and `exactly_once`
>> as
>>>>>>>>> values.
>>>>>>>>>>>>>>>>> Personally, I don't care about underscore vs dash but I
>>> prefer
>>>>>>>>>>>>>>>>> consistency. If you feel strong about it, we can also
>> change
>>>>> it to
>>>>>>>>>>>>> `-`.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> About the interface name: I am fine either way -- I
>> stripped
>>>>> the
>>>>>>>>>>>>> `With`
>>>>>>>>>>>>>>>>> to keep the name a little shorter. Would be good to get
>>>>> feedback
>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> others and pick the name the majority prefers.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @John:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We can certainly change it. I agree that it would not
>> make a
>>>>>>>>>>>>> difference.
>>>>>>>>>>>>>>>>> I'll dig into the code to see if any of the two version
>>> might
>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>> undesired complexity and update the KIP if I don't hit an
>>>>> issue
>>>>>>>>> with
>>>>>>>>>>>>>>>>> putting the `-v2` to the store directory instead of
>>>>> `rocksdb-v2`
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 3/8/18 2:44 PM, John Roesler wrote:
>>>>>>>>>>>>>>>>>> Hey Matthias,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The KIP looks good to me. I had several questions queued
>>> up,
>>>>> but
>>>>>>>>>>> they
>>>>>>>>>>>>>>>>> were
>>>>>>>>>>>>>>>>>> all in the "rejected alternatives" section... oh, well.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One very minor thought re changing the state directory
>> from
>>>>>>>>>>>>>>>>> "/<state.dir>/<
>>>>>>>>>>>>>>>>>> application.id>/<task.id>/rocksdb/storeName/" to
>>>>> "/<state.dir>/<
>>>>>>>>>>>>>>>>>> application.id>/<task.id>/rocksdb-v2/storeName/": if you
>>>>> put the
>>>>>>>>>>>>> "v2"
>>>>>>>>>>>>>>>>>> marker on the storeName part of the path (i.e.,
>>>>> "/<state.dir>/<
>>>>>>>>>>>>>>>>>> application.id>/<task.id>/rocksdb/storeName-v2/"), then
>>> you
>>>>> get
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>> benefits without altering the high-level directory
>>> structure.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It may not matter, but I could imagine people running
>>>>> scripts to
>>>>>>>>>>>>>>>> monitor
>>>>>>>>>>>>>>>>>> rocksdb disk usage for each task, or other such use
>> cases.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Mar 8, 2018 at 2:02 PM, Ted Yu <
>>> [email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Matthias:
>>>>>>>>>>>>>>>>>>> Nicely written KIP.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> "in_place" : can this be "in-place" ? Underscore may
>>>>> sometimes
>>>>>>>>> be
>>>>>>>>>>>>> miss
>>>>>>>>>>>>>>>>>>> typed (as '-'). I think using '-' is more friendly to
>>> user.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> public interface ReadOnlyKeyValueTimestampStore<K, V> {
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Is ReadOnlyKeyValueStoreWithTimestamp better name for
>> the
>>>>>>>>> class ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Mar 8, 2018 at 1:29 PM, Guozhang Wang <
>>>>>>>>> [email protected]
>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hello Matthias, thanks for the KIP.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I've read through the upgrade patch section and it
>> looks
>>>>> good
>>>>>>>>> to
>>>>>>>>>>>>> me,
>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>> already have a WIP PR for it could you also share it
>> here
>>>>> so
>>>>>>>>> that
>>>>>>>>>>>>>>>>> people
>>>>>>>>>>>>>>>>>>>> can take a look?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'm +1 on the KIP itself. But large KIPs like this
>> there
>>>>> are
>>>>>>>>>>> always
>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>> devil hidden in the details, so I think it is better to
>>>>> have
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> implementation in parallel along with the design
>>>>> discussion :)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Mar 7, 2018 at 2:12 PM, Matthias J. Sax <
>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I want to propose KIP-258 for the Streams API to allow
>>>>> storing
>>>>>>>>>>>>>>>>>>>>> timestamps in RocksDB. This feature is the basis to
>>>>> resolve
>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>>>>>> tickets (issues and feature requests).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Looking forward to your comments about this!
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>>>>>>>>> 258%3A+Allow+to+Store+Record+Timestamps+in+RocksDB
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> -- Guozhang
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-258: Allow to Store Record Timestamps in RocksDB

Reply via email to