Re: [DISCUSS] KIP-116 - Add State Store Checkpoint Interval Configuration

Damian Guy Wed, 15 Feb 2017 05:41:16 -0800

Ok. lets close this KIP off then as it isn't needed at the moment. We can
revive later if needed.


On Tue, 14 Feb 2017 at 04:16 Eno Thereska <eno.there...@gmail.com> wrote:

> Even if users commit on every record, the expensive part will not be the
> checkpointing proposed in this KIP, but the rest of the commit.
>
> Eno
>
>
> > On 13 Feb 2017, at 23:46, Guozhang Wang <wangg...@gmail.com> wrote:
> >
> > I think I'm OK to always enable checkpointing, but I'm not sure if we
> want
> > to checkpoint on every commit. Since in the extreme case users can commit
> > on completed processing each record. So I think it is still valuable to
> > have a checkpoint internal config in this KIP, which can be ignored if
> EOS
> > is turned on. That being said, if most people are favoring checkpointing
> on
> > each commit we can try that with this as well, since it won't change any
> > public APIs and we can still add this config in the future if we do
> observe
> > some users reporting it has huge perf impacts.
> >
> >
> >
> > Guozhang
> >
> > On Fri, Feb 10, 2017 at 12:20 PM, Damian Guy <damian....@gmail.com>
> wrote:
> >
> >> I'm fine with that. Gouzhang?
> >> On Fri, 10 Feb 2017 at 19:45, Matthias J. Sax <matth...@confluent.io>
> >> wrote:
> >>
> >>> I am actually supporting Eno's view: checkpoint on every commit.
> >>>
> >>> @Dhwani: I understand your view and did raise the same question about
> >>> performance trade-off with checkpoiting enabled/disabled etc. However,
> >>> it seems that writing the checkpoint file is super cheap -- thus, there
> >>> is nothing to gain performance wise by disabling it.
> >>>
> >>> For Streams EoS we do not need the checkpoint file -- but we should
> have
> >>> a switch for EoS anyway and can disable the checkpoint file for this
> >>> case. And even if there is no switch and we enable EoS all the time, we
> >>> can get rid of the checkpoint file overall (making the parameter
> >> obsolete).
> >>>
> >>> IMHO, if the config parameter is not really useful, we should not have
> >> it.
> >>>
> >>>
> >>> -Matthias
> >>>
> >>>
> >>> On 2/10/17 9:27 AM, Damian Guy wrote:
> >>>> Gouzhang, Thanks for the clarification. Understood.
> >>>>
> >>>> Eno, you are correct if we just used commit interval then we wouldn't
> >>> need
> >>>> a KIP. But, then we'd have no way of turning it off.
> >>>>
> >>>> On Fri, 10 Feb 2017 at 17:14 Eno Thereska <eno.there...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> A quick check: the checkpoint file is not new, we're just exposing a
> >>> knob
> >>>>> on when to set it, right? Would turning if off still do what it does
> >>> today
> >>>>> (i.e., write the checkpoint at the end when the user quits?) So it's
> >>> not a
> >>>>> new feature as such, I was only recommending we dial up the frequency
> >> by
> >>>>> default. With that option arguably we don't even need a KIP.
> >>>>>
> >>>>> Eno
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On 10 Feb 2017, at 17:02, Guozhang Wang <wangg...@gmail.com> wrote:
> >>>>>>
> >>>>>> Damian,
> >>>>>>
> >>>>>> I was thinking if it is a new failure scenarios but as Eno pointed
> >> out
> >>> it
> >>>>>> was not.
> >>>>>>
> >>>>>> Another thing I was considering is if it has any impact for
> >>> incorporating
> >>>>>> KIP-98 to avoid duplicates: if there is a failure in the middle of a
> >>>>>> transaction, then upon recovery we cannot rely on the local state
> >> store
> >>>>>> file even if the checkpoint file exists, since the local state store
> >>> file
> >>>>>> may not be at the transaction boundaries. But since Streams will
> >> likely
> >>>>> to
> >>>>>> have EOS as an opt-in I think it is still worthwhile to add this
> >>> feature,
> >>>>>> just keeping in mind that when EOS is turned on it may cease to be
> >>>>>> effective.
> >>>>>>
> >>>>>> And yes, I'd suggest we leave the config value to be possibly
> >>>>> non-positive
> >>>>>> to indicate not turning on this feature for the reason above: if it
> >>> will
> >>>>>> not be effective then we want to leave it as an option to be turned
> >>> off.
> >>>>>>
> >>>>>> Guozhang
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Feb 10, 2017 at 8:06 AM, Eno Thereska <
> >> eno.there...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> The overhead of writing to the checkpoint file should be much, much
> >>>>>>> smaller than the overall overhead of doing a commit, so I think
> >> tuning
> >>>>> the
> >>>>>>> commit time is sufficient to guide performance tradeoffs.
> >>>>>>>
> >>>>>>> Eno
> >>>>>>>
> >>>>>>>> On 10 Feb 2017, at 13:08, Dhwani Katagade <
> >>>>> dhwani_katag...@persistent.co
> >>>>>>> .in> wrote:
> >>>>>>>>
> >>>>>>>> May be for fine tuning the performance.
> >>>>>>>> Say we don't need the checkpointing and would like to gain the lil
> >>> bit
> >>>>>>> of performance improvement by turning it off.
> >>>>>>>> The trade off is between giving people control knobs vs
> >> complicating
> >>>>> the
> >>>>>>> complete set of knobs.
> >>>>>>>>
> >>>>>>>> -dk
> >>>>>>>>
> >>>>>>>> On Friday 10 February 2017 04:05 PM, Eno Thereska wrote:
> >>>>>>>>> I can't see why users would care to turn it off.
> >>>>>>>>>
> >>>>>>>>> Eno
> >>>>>>>>>> On 10 Feb 2017, at 10:29, Damian Guy <damian....@gmail.com>
> >> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Eno,
> >>>>>>>>>>
> >>>>>>>>>> Sounds good to me. The only reason i can think of is if we want
> >> to
> >>> be
> >>>>>>> able
> >>>>>>>>>> to turn it off.
> >>>>>>>>>> Gouzhang - thoughts?
> >>>>>>>>>>
> >>>>>>>>>> On Fri, 10 Feb 2017 at 10:28 Eno Thereska <
> >> eno.there...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Question: if checkpointing is so cheap why not do it every
> >> commit
> >>>>>>>>>>> interval? That way we can get rid of this extra config variable
> >>> and
> >>>>>>> just
> >>>>>>>>>>> use the existing commit interval.
> >>>>>>>>>>>
> >>>>>>>>>>> Less tuning knobs.
> >>>>>>>>>>>
> >>>>>>>>>>> Eno
> >>>>>>>>>>>
> >>>>>>>>>>>> On 10 Feb 2017, at 09:27, Damian Guy <damian....@gmail.com>
> >>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Gouzhang,
> >>>>>>>>>>>>
> >>>>>>>>>>>> You've confused me. The failure scenarios you have described
> >> are
> >>>>> the
> >>>>>>> same
> >>>>>>>>>>>> as they are today. With the checkpoint files in place less
> data
> >>>>> will
> >>>>>>> be
> >>>>>>>>>>>> replayed, so there will be fewer duplicates.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Are you saying you'd like the option to turn checkpointing
> off?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Damian
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, 9 Feb 2017 at 21:55 Guozhang Wang <wangg...@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Eno,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> You are right, it is not a new scenario.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thinking a bit more on how we could incorporate KIP-98 in
> >>>>> Streams, I
> >>>>>>>>>>> feel
> >>>>>>>>>>>>> that if EOS is turned on inside Streams, then we probably
> >> cannot
> >>>>>>> always
> >>>>>>>>>>>>> resume from the checkpointed offsets as it is not guaranteed
> >> to
> >>> be
> >>>>>>>>>>>>> "consistent"; but since EOS may not be turned on by default
> >> this
> >>>>> is
> >>>>>>>>>>> still
> >>>>>>>>>>>>> worthwhile to add this feature I guess.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> About the default config values: I think the default value of
> >> 5
> >>>>> min
> >>>>>>> is
> >>>>>>>>>>> OK
> >>>>>>>>>>>>> to me, since restoration is usually faster than normal
> >>> processing
> >>>>>>>>>>> (unless
> >>>>>>>>>>>>> your traffic was really high), about allowing it to be
> "turned
> >>>>> off"
> >>>>>>>>>>> with a
> >>>>>>>>>>>>> non-positive value: I feel there are still values to keep
> this
> >>>>> door
> >>>>>>>>>>> open as
> >>>>>>>>>>>>> in the future if EOS is turned on, people may just want to
> >> turn
> >>>>> off
> >>>>>>>>>>>>> checkpointing anyways, or there maybe other scenarios that we
> >>> have
> >>>>>>> not
> >>>>>>>>>>>>> realized yet. On the other hand, I would argue that it is
> less
> >>>>>>> likely
> >>>>>>>>>>> users
> >>>>>>>>>>>>> mistakenly set it to a non-positive value.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Feb 9, 2017 at 1:03 PM, Eno Thereska <
> >>>>>>> eno.there...@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Guozhang,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It seems to me we have the same semantics today. Are you
> >> saying
> >>>>>>> there
> >>>>>>>>>>> is
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>> new failure scenario?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Eno
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 9 Feb 2017, at 19:42, Guozhang Wang <wangg...@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> More specifically, here is my reasoning of failure cases,
> >> and
> >>>>>>> would
> >>>>>>>>>>>>> like
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> get your feedbacks:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *StreamTask*
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> For stream-task, the committing order is 1) flush state
> (may
> >>>>> send
> >>>>>>> more
> >>>>>>>>>>>>>>> records to changelog in producer), 2) flush producer, 3)
> >>> commit
> >>>>>>>>>>>>> upstream
> >>>>>>>>>>>>>>> offsets. My understanding is that the writing of the
> >>> checkpoint
> >>>>>>> file
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>> between 2) and 3). So thatt he new order will be 1) flush
> >>> state,
> >>>>>>> 2)
> >>>>>>>>>>>>> flush
> >>>>>>>>>>>>>>> producer, 3) write checkpoint file (when necessary), 4)
> >> commit
> >>>>>>>>>>> upstream
> >>>>>>>>>>>>>>> offsets.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> And we have a bunch of "changelog offsets" regarding the
> >>> state:
> >>>>> a)
> >>>>>>>>>>>>> offset
> >>>>>>>>>>>>>>> corresponding to the image of the persistent file, name it
> >>> point
> >>>>>>> A, b)
> >>>>>>>>>>>>>> log
> >>>>>>>>>>>>>>> end offset, name it offset B, c) checkpoint file recorded
> >>>>> offset,
> >>>>>>> name
> >>>>>>>>>>>>> it
> >>>>>>>>>>>>>>> offset C, d) offset corresponding to the current committed
> >>>>>>> upstream
> >>>>>>>>>>>>>> offset,
> >>>>>>>>>>>>>>> name it offset D.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Now let's talk about the failure cases:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If there is a crash between 1) and 2), then A > B = C = D.
> >> In
> >>>>> this
> >>>>>>>>>>>>> case,
> >>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>> we restore, we will replay no logs at all since B = C while
> >>> the
> >>>>>>>>>>>>>> persistent
> >>>>>>>>>>>>>>> state file is actually "ahead of time", and we will start
> >>>>>>> reprocessing
> >>>>>>>>>>>>>>> since from the input offset corresponding to D = B < A and
> >>> hence
> >>>>>>> have
> >>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>> duplicated, *which will be incorrect* if the update logic
> >>>>> involve
> >>>>>>>>>>>>> reading
> >>>>>>>>>>>>>>> the state store values as well (i.e. not a blind write),
> >> e.g.
> >>>>>>>>>>>>>> aggregations.
> >>>>>>>>>>>>>>> If there is a crash between 2) and 3), then A = B > C = D.
> >>> When
> >>>>> we
> >>>>>>>>>>>>>> restore,
> >>>>>>>>>>>>>>> we will replay from C -> B = A, and then start reprocessing
> >>> from
> >>>>>>> input
> >>>>>>>>>>>>>>> offset corresponding to D < A, and same issue applies as
> >>> above.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If there is a crash between 3) and 4), then A = B = C > D.
> >>> When
> >>>>> we
> >>>>>>>>>>>>>> restore,
> >>>>>>>>>>>>>>> we will not replay, and then start reprocessing from input
> >>>>> offset
> >>>>>>>>>>>>>>> corresponding to D < A, and same issue applies as above.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *StandbyTask*
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We only do one operation today, which is 1) flush state, I
> >>> think
> >>>>>>> we
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>> add the writing of the checkpoint file after it as step 2).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Failure cases again: offset A -> correspond to the image of
> >>> the
> >>>>>>> file,
> >>>>>>>>>>>>>>> offset B -> changelog end offset, offset C -> written as in
> >>> the
> >>>>>>>>>>>>>> checkpoint
> >>>>>>>>>>>>>>> file.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If there is a crash between 1) and 2), then B >= A > C (B
> >>> = A
> >>>>>>> because
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>> are reading from changelog topic so A will never be greater
> >>> than
> >>>>>>> B),
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1) and if this task resumes as a standby task, we will
> >> resume
> >>>>>>>>>>>>> restoration
> >>>>>>>>>>>>>>> from offset C, and a few duplicates from C -> A will be
> >>> applied
> >>>>>>> again
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> local state files, then continue from A -> B, *this is OK*
> >>> since
> >>>>>>> they
> >>>>>>>>>>>>> do
> >>>>>>>>>>>>>>> not incur any computations hence no side effects and are
> all
> >>>>>>>>>>>>> idempotent.
> >>>>>>>>>>>>>>> 2) and if this task resumes as a stream task, we will
> replay
> >>>>>>>>>>> changelogs
> >>>>>>>>>>>>>>> from C -> A, with duplicated updates, and then from A -> B.
> >>> This
> >>>>>>> is
> >>>>>>>>>>>>> also
> >>>>>>>>>>>>>> OK
> >>>>>>>>>>>>>>> for the same reason as above.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> So it seems to me that this is not safe for a StreamTask,
> or
> >>>>>>> maybe the
> >>>>>>>>>>>>>>> writing of the checkpoint file in your mind is different?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Feb 9, 2017 at 11:02 AM, Guozhang Wang <
> >>>>>>> wangg...@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>> A quick question re: `We will add the above config
> >> parameter
> >>> to
> >>>>>>>>>>>>>>>> *StreamsConfig*. During *StreamTask#commit()*,
> >>>>>>>>>>> *StandbyTask#commit()*,
> >>>>>>>>>>>>>>>> and *GlobalUpdateStateTask#flushState()* we will check if
> >> the
> >>>>>>>>>>>>>> checkpoint
> >>>>>>>>>>>>>>>> interval has elapsed and write the checkpoint file.`
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Will the writing of the checkpoint file happen before the
> >>>>>>> flushing of
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> state manager?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, Feb 9, 2017 at 10:48 AM, Matthias J. Sax <
> >>>>>>>>>>>>> matth...@confluent.io
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> But 5 min means, that we (in the worst case) need to
> reply
> >>>>> data
> >>>>>>> from
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> last 5 minutes to get the store ready.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> So why not go with the min possible value of 30 seconds
> to
> >>>>>>> speed up
> >>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> process if the impact is negligible anyway?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> What do you gain by being conservative?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 2/9/17 2:54 AM, Damian Guy wrote:
> >>>>>>>>>>>>>>>>>> Why shouldn't it be 5 minutes? ;-)
> >>>>>>>>>>>>>>>>>> It is a finger in the air number. Based on the testing i
> >>> did
> >>>>> it
> >>>>>>>>>>>>> shows
> >>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>> there isn't much, if any, overhead when checkpointing a
> >>>>> single
> >>>>>>>>>>> store
> >>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> commit interval. The default commit interval is 30
> >> seconds,
> >>>>> so
> >>>>>>> it
> >>>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>>> possibly be set to that. However, i'd prefer to be a
> >> little
> >>>>>>>>>>>>>>>>> conservative so
> >>>>>>>>>>>>>>>>>> 5 minutes seemed reasonable.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Thu, 9 Feb 2017 at 10:25 Michael Noll <
> >>>>> mich...@confluent.io
> >>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>> Damian,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> could you elaborate briefly why the default value
> should
> >>> be
> >>>>> 5
> >>>>>>>>>>>>>> minutes?
> >>>>>>>>>>>>>>>>>>> What are the considerations, assumptions, etc. that go
> >>> into
> >>>>>>>>>>> picking
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> value?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Right now, in the KIP and in this discussion, "5 mins"
> >>> looks
> >>>>>>> like
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> magic
> >>>>>>>>>>>>>>>>>>> number to me. :-)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> -Michael
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Thu, Feb 9, 2017 at 11:03 AM, Damian Guy <
> >>>>>>> damian....@gmail.com
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>> I've ran the SimpleBenchmark with checkpoint on and
> off
> >>> to
> >>>>>>> see
> >>>>>>>>>>>>> what
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> impact is. It appears that there is very little
> impact,
> >>> if
> >>>>>>> any.
> >>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>> numbers
> >>>>>>>>>>>>>>>>>>>> with checkpointing on actually look better, but that
> is
> >>>>>>> likely
> >>>>>>>>>>>>>> largely
> >>>>>>>>>>>>>>>>>>> due
> >>>>>>>>>>>>>>>>>>>> to external influences.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> In any case, i'm going to suggest we go with a default
> >>>>>>> checkpoint
> >>>>>>>>>>>>>>>>>>> interval
> >>>>>>>>>>>>>>>>>>>> of 5 minutes. I've update the KIP with this.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> commit every 10 seconds (no checkpoint)
> >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>>>>> 10000000/34798/287372.83751939767/29.570664980746017
> >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>>>>> 10000000/35942/278226.0308274442/28.62945857214401
> >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>>>>> 10000000/34677/288375.58035585546/29.673847218617528
> >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>>>>> 10000000/34677/288375.58035585546/29.673847218617528
> >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>>>>> 10000000/31192/320595.02436522185/32.98922800718133
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> checkpoint every 10 seconds (same as commit interval)
> >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>>>>> 10000000/36997/270292.185852907/27.81306592426413
> >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>>>>> 10000000/32087/311652.69423754164/32.069062237043035
> >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>>>>> 10000000/32895/303997.5680194558/31.281349749202004
> >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>>>>> 10000000/33476/298721.4720994145/30.738439479029754
> >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>>>>> 10000000/33196/301241.1133871551/30.99771056753826
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Wed, 8 Feb 2017 at 09:02 Damian Guy <
> >>>>> damian....@gmail.com
> >>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>> Matthias,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Fair point. I'll update it the KIP.
> >>>>>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Wed, 8 Feb 2017 at 05:49 Matthias J. Sax <
> >>>>>>>>>>>>> matth...@confluent.io
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>> Damian,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I am not strict about it either. However, if there is
> >> no
> >>>>>>>>>>>>> advantage
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>> disabling it, we might not want to allow it. This
> >> would
> >>>>>>> have the
> >>>>>>>>>>>>>>>>>>>>> advantage to guard users to accidentally switch it
> >> off.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On 2/3/17 2:03 AM, Damian Guy wrote:
> >>>>>>>>>>>>>>>>>>>>>> Hi Matthias,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> It possibly doesn't make sense to disable it, but
> >> then
> >>>>> i'm
> >>>>>>> sure
> >>>>>>>>>>>>>>>>>>> someone
> >>>>>>>>>>>>>>>>>>>>>> will come up with a reason they don't want it!
> >>>>>>>>>>>>>>>>>>>>>> I'm happy to change it such that the checkpoint
> >>> interval
> >>>>>>> must
> >>>>>>>>>>>>> be >
> >>>>>>>>>>>>>>>>> 0.
> >>>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>>> Damian
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Fri, 3 Feb 2017 at 01:29 Matthias J. Sax <
> >>>>>>>>>>>>>> matth...@confluent.io>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>> Thanks Damian.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> One more question: "Checkpointing is disabled if
> the
> >>>>>>>>>>> checkpoint
> >>>>>>>>>>>>>>>>>>>> interval
> >>>>>>>>>>>>>>>>>>>>>>> is set to a value <=0."
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Does it make sense to disable check pointing?
> What's
> >>> the
> >>>>>>>>>>>>> tradeoff
> >>>>>>>>>>>>>>>>>>>> here?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On 2/2/17 1:51 AM, Damian Guy wrote:
> >>>>>>>>>>>>>>>>>>>>>>>> Hi Matthias,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the comments.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 1. TBD - i need to do some performance tests and
> >> try
> >>>>> and
> >>>>>>> work
> >>>>>>>>>>>>>> out
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>> sensible default.
> >>>>>>>>>>>>>>>>>>>>>>>> 2. Yes, you are correct. It could be a multiple of
> >>> the
> >>>>>>>>>>>>>>>>>>>>>>> commit.interval.ms.
> >>>>>>>>>>>>>>>>>>>>>>>> But, that would also mean if you change the commit
> >>>>>>> interval -
> >>>>>>>>>>>>>> say
> >>>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>>> lower
> >>>>>>>>>>>>>>>>>>>>>>>> it, then you might also need to change the
> >> checkpoint
> >>>>>>> setting
> >>>>>>>>>>>>>>>>> (i.e,
> >>>>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>>>> still only want to checkpoint every n minutes).
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Wed, 1 Feb 2017 at 23:46 Matthias J. Sax <
> >>>>>>>>>>>>>>>>> matth...@confluent.io
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP Damian.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I am wondering about two things:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> 1. what should be the default value for the new
> >>>>>>> parameter?
> >>>>>>>>>>>>>>>>>>>>>>>>> 2. why is the new parameter provided in ms?
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> About (2): because
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> "the minimum checkpoint interval will be the
> value
> >>> of
> >>>>>>>>>>>>>>>>>>>>>>>>> commit.interval.ms. In effect the actual
> >> checkpoint
> >>>>>>>>>>> interval
> >>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>> multiple of the commit interval"
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> it might be easier to just use an parameter that
> >> is
> >>>>>>>>>>>>>>>>>>>> "number-or-commit
> >>>>>>>>>>>>>>>>>>>>>>>>> intervals".
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On 2/1/17 7:29 AM, Damian Guy wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the comments Eno.
> >>>>>>>>>>>>>>>>>>>>>>>>>> As for exactly once, i don't believe this
> matters
> >>> as
> >>>>>>> we are
> >>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>>>>>>>>> restoring
> >>>>>>>>>>>>>>>>>>>>>>>>>> the change-log, i.e, the result of the
> >> aggregations
> >>>>>>> that
> >>>>>>>>>>>>>>>>>>> previously
> >>>>>>>>>>>>>>>>>>>>> ran
> >>>>>>>>>>>>>>>>>>>>>>>>>> etc. So once initialized the state store will be
> >> in
> >>>>> the
> >>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>> state
> >>>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>>>> was before.
> >>>>>>>>>>>>>>>>>>>>>>>>>> Having the checkpoint in a kafka topic is not
> >> ideal
> >>>>> as
> >>>>>>> the
> >>>>>>>>>>>>>> state
> >>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>> per
> >>>>>>>>>>>>>>>>>>>>>>>>>> kafka streams instance. So each instance would
> >> need
> >>>>> to
> >>>>>>>>>>> start
> >>>>>>>>>>>>>>>>>>> with a
> >>>>>>>>>>>>>>>>>>>>>>>>> unique
> >>>>>>>>>>>>>>>>>>>>>>>>>> id that is persistent.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>>>>>>> Damian
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 1 Feb 2017 at 13:20 Eno Thereska <
> >>>>>>>>>>>>>>>>> eno.there...@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>> As a follow up to my previous comment, have you
> >>>>>>> thought
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>> writing
> >>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint to a topic instead of a local file?
> >>> That
> >>>>>>> would
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>> advantage that all metadata continues to be
> >>> managed
> >>>>> by
> >>>>>>>>>>>>> Kafka,
> >>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>> well
> >>>>>>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>>>>>> fit with EoS. The potential disadvantage would
> >> be
> >>> a
> >>>>>>> slower
> >>>>>>>>>>>>>>>>>>>> latency,
> >>>>>>>>>>>>>>>>>>>>>>>>> however
> >>>>>>>>>>>>>>>>>>>>>>>>>>> if it is periodic as you mention, I'm not sure
> >>> that
> >>>>>>> would
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> show
> >>>>>>>>>>>>>>>>>>>>>>>>> stopper.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Eno
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On 1 Feb 2017, at 12:58, Eno Thereska <
> >>>>>>>>>>>>>> eno.there...@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Damian, this is a good idea and will
> >>> reduce
> >>>>>>> the
> >>>>>>>>>>>>>> restore
> >>>>>>>>>>>>>>>>>>>>> time.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward, with exactly once and support
> >> for
> >>>>>>>>>>>>>> transactions
> >>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>> Kafka, I
> >>>>>>>>>>>>>>>>>>>>>>>>>>> believe we'll have to add some support for
> >> rolling
> >>>>>>> back
> >>>>>>>>>>>>>>>>>>>> checkpoints,
> >>>>>>>>>>>>>>>>>>>>>>>>> e.g.,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> when a transaction is aborted. We need to be
> >> aware
> >>>>> of
> >>>>>>> that
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> ideally
> >>>>>>>>>>>>>>>>>>>>>>>>>>> anticipate a bit those needs in the KIP.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Eno
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 1 Feb 2017, at 10:18, Damian Guy <
> >>>>>>>>>>>>> damian....@gmail.com>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to start the discussion on
> >> KIP-116:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
> >> confluence/display/KAFKA/KIP-
> >>>>>>>>>>>>>>>>>>>> 116+-+Add+State+Store+Checkpoint+Interval+
> >> Configuration
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Damian
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> DISCLAIMER
> >>>>>>>> ==========
> >>>>>>>> This e-mail may contain privileged and confidential information
> >> which
> >>>>> is
> >>>>>>> the property of Persistent Systems Ltd. It is intended only for the
> >>> use
> >>>>> of
> >>>>>>> the individual or entity to which it is addressed. If you are not
> >> the
> >>>>>>> intended recipient, you are not authorized to read, retain, copy,
> >>> print,
> >>>>>>> distribute or use this message. If you have received this
> >>> communication
> >>>>> in
> >>>>>>> error, please notify the sender and delete all copies of this
> >> message.
> >>>>>>> Persistent Systems Ltd. does not accept any liability for virus
> >>> infected
> >>>>>>> mails.
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> -- Guozhang
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > -- Guozhang
>
>

Re: [DISCUSS] KIP-116 - Add State Store Checkpoint Interval Configuration

Reply via email to