Re: [Proposal] Named Checkpoints

Tushar Gosavi Wed, 10 Aug 2016 02:18:02 -0700

The prototype implementation assume that checkpoints are always stored
in HDFS, but user could implements their own
storage agent. In this case this implementation may not work. The more
useful approach would be to have a metadata file
for each savepoint which stores operator id and checkpoint id. and
prevent master from purging those checkpoints on commit.
during restart the storage agent can get required checkpoint from its
store, and which checkpoints to load will be available in
savepoint metadata file.


- Tushar.



On Mon, Aug 8, 2016 at 8:49 PM, Sandesh Hegde <sand...@datatorrent.com> wrote:
> The idea here was to create, on demand, recovery/committed window. But
> there is always one(except before the first) recovery window for the DAG.
> Instead of using/modifying the Checkpoint tuple, I am planning to reuse
> the existing recovery window state, which simplifies the implementation.
>
> Proposed API:
>
> ApexCli> savepoint <appId> <folderToSaveTheState>
> ApexCli> launch -savepoint <folderWithTheState>
>
> first prototype:
> https://github.com/sandeshh/apex-core/commit/8ec7e837318c2b33289251cda78ece0024a3f895
>
> Thanks
>
> On Thu, Aug 4, 2016 at 11:54 AM Amol Kekre <a...@datatorrent.com> wrote:
>
>> hmm! actually it may be a good debugging tool too. Keep the named
>> checkpoints around. The feature is to keep checkpoints around, which can be
>> done by giving a feature to not delete checkpoints, but then naming them
>> makes it more operational. Send a command from cli->get checkpoint -> know
>> it is the one you need as the file name has your string you send with the
>> command -> debug. This is different that querying a state as this gives
>> entire app checkpoint to debug with.
>>
>> Thks
>> Amol
>>
>>
>> On Thu, Aug 4, 2016 at 11:41 AM, Venkatesh Kottapalli <
>> venkat...@datatorrent.com> wrote:
>>
>> > + 1 for the idea.
>> >
>> > It might be helpful to developers as well when dealing with variety of
>> > data in large volumes if this can help them run from the checkpointed
>> state
>> > rather than rerunning the application altogether in case of issues.
>> >
>> > I have seen cases where the application runs for more than 10 hours and
>> > some partitions fail because of the variety of data that it is dealing
>> > with. In such cases, the application has to be restarted and it will be
>> > helpful to developers with a feature of this kind.
>> >
>> >  The ease of enabling/disabling this feature to run the app will also be
>> > important.
>> >
>> > -Venkatesh.
>> >
>> >
>> > > On Aug 4, 2016, at 10:29 AM, Amol Kekre <a...@datatorrent.com> wrote:
>> > >
>> > > We had an user who wanted roll-back and restart from audit purposes.
>> That
>> > > time we did not have timed-window. Names checkpoint would have helped a
>> > > little bit..
>> > >
>> > > Problem statement: Auditors ask for rerun of yesterday's computations
>> for
>> > > verification. Assume that these computations depend on previous state
>> > (i.e
>> > > data from day before yesterday).
>> > >
>> > > Solution
>> > > 1. Have named checkpoints at 12 in the night (an input adapter triggers
>> > it)
>> > > every day
>> > > 2. The app spools raw logs into hdfs along with window ids and event
>> > times
>> > > 3. The re-run is a separate app that starts off on a named checkpoint
>> (12
>> > > night yesterday)
>> > >
>> > > Technically the solution will not as simple and "new audit app" will
>> > need a
>> > > lot of other checks (dedups, drop events not in yesterday's window,
>> wait
>> > > for late arrivals, ...), but names checkpoint helps.
>> > >
>> > > I do agree with Pramod's that replay within the same running app is not
>> > > viable within a data-in-motion architecture. But it helps somewhat in a
>> > new
>> > > audit app. Named checkpoints help data-in-motion architectures handle
>> > batch
>> > > apps better. In the above case #2 spooling done with event time
>> > stamp+state
>> > > suffices. The state part comes from names checkpoint.
>> > >
>> > > Thks,
>> > > Amol
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Aug 4, 2016 at 10:12 AM, Sanjay Pujare <san...@datatorrent.com
>> >
>> > > wrote:
>> > >
>> > >> I agree. A specific use-case will be useful to support this feature.
>> > Also
>> > >> the ability to replay from the named checkpoint will be limited
>> because
>> > of
>> > >> various factors, isn’t it?
>> > >>
>> > >> On 8/4/16, 9:00 AM, "Pramod Immaneni" <pra...@datatorrent.com> wrote:
>> > >>
>> > >>    There is a problem here, keeping old checkpoints and recovering
>> from
>> > >> them
>> > >>    means preserving the old input data along with the state. This is
>> > more
>> > >> than
>> > >>    the mechanism of actually creating named checkpoints, it means
>> having
>> > >> the
>> > >>    ability for operators to move forward (a.k.a committed and dropping
>> > >>    committed states and buffer data) while still having the ability to
>> > >> replay
>> > >>    from that point from the input source and providing a way for
>> > >> operators (at
>> > >>    first look input operators) to distinguish that. Why would someone
>> > need
>> > >>    this with idempotent processing? Is there a specific use case you
>> are
>> > >>    looking at? Suppose we go do this, for the mechanism, I would be in
>> > >> favor
>> > >>    of reusing existing tuple.
>> > >>
>> > >>    On Thu, Aug 4, 2016 at 8:44 AM, Vlad Rozov <
>> v.ro...@datatorrent.com>
>> > >> wrote:
>> > >>
>> > >>> +1 for the feature. At first look I am more in favor of reusing
>> > >> existing
>> > >>> control tuple.
>> > >>>
>> > >>> Thank you,
>> > >>>
>> > >>> Vlad
>> > >>>
>> > >>>
>> > >>> On 8/4/16 08:17, Sandesh Hegde wrote:
>> > >>>
>> > >>>> @Chinmay
>> > >>>> We can enhance the existing checkpoint tuple but that one is more
>> > >>>> frequently used than this feature, so why burden Checkpoint tuple
>> > >> with
>> > >>>> an extra field?
>> > >>>>
>> > >>>> @Aniruddha
>> > >>>> It is better to leave the scheduling to the users, they can use any
>> > >> tool
>> > >>>> that they are already familiar with.
>> > >>>>
>> > >>>> On Thu, Aug 4, 2016 at 7:40 AM Aniruddha Thombare <
>> > >>>> anirud...@datatorrent.com>
>> > >>>> wrote:
>> > >>>>
>> > >>>> +1 On the idea, it would be awesome to have.
>> > >>>>>
>> > >>>>> Question: Can we further develop this brilliant idea into:-
>> > >>>>> Scheduled checkpoints ( To save as  dynamically named checkpoint)?
>> > >>>>> This would be on the lines of logrotate / general backup
>> > >> strategies.
>> > >>>>>
>> > >>>>>
>> > >>>>> Thanks,
>> > >>>>>
>> > >>>>> A
>> > >>>>>
>> > >>>>> _____________________________________
>> > >>>>> Sent with difficulty, I mean handheld ;)
>> > >>>>> On 4 Aug 2016 8:03 pm, "Munagala Ramanath" <r...@datatorrent.com>
>> > >> wrote:
>> > >>>>>
>> > >>>>> +1
>> > >>>>>>
>> > >>>>>> Ram
>> > >>>>>>
>> > >>>>>> On Thu, Aug 4, 2016 at 12:10 AM, Sandesh Hegde <
>> > >> sand...@datatorrent.com
>> > >>>>>>>
>> > >>>>>> wrote:
>> > >>>>>>
>> > >>>>>> Hello Team,
>> > >>>>>>>
>> > >>>>>>> This thread is to discuss the Named Checkpoint feature for Apex.
>> > >> (
>> > >>>>>>> https://issues.apache.org/jira/browse/APEXCORE-498)
>> > >>>>>>>
>> > >>>>>>> Named checkpoints allow following workflow,
>> > >>>>>>>
>> > >>>>>>> 1. Users can trigger a checkpoint and give it a name
>> > >>>>>>> 2. Relaunch the application from the named checkpoint.
>> > >>>>>>> 3. These checkpoints survive the "purge of old checkpoints".
>> > >>>>>>>
>> > >>>>>>> Current idea is to add a new control tuple,
>> > >> NamedCheckPointTuple, which
>> > >>>>>>> contains the user specified name, it traverses the DAG and along
>> > >> the
>> > >>>>>>>
>> > >>>>>> way
>> > >>>>>
>> > >>>>>> necessary actions are taken.
>> > >>>>>>>
>> > >>>>>>> Please let me know your thoughts on this.
>> > >>>>>>>
>> > >>>>>>> Thanks
>> > >>>>>>>
>> > >>>>>>>
>> > >>>
>> > >>
>> > >>
>> > >>
>> > >>
>> >
>> >
>>

Re: [Proposal] Named Checkpoints

Reply via email to