We've seen  cases where operator state continues to grow without bound
either because
the developer was unaware of the importance of keeping state small or
because of some
anomaly downstream. In such cases, the operators could get killed with an
OOM exception because
these checkpoints are building up in memory faster than they can be written
to disk.

These stats may be useful in such cases to identify the root cause of
failure.

Ram

On Sun, Sep 25, 2016 at 7:39 AM, Sandesh Hegde <sand...@datatorrent.com>
wrote:

> Say it takes x MB size and y seconds to do the checkpoint. What does the
> user do with that information?
>
> On Sun, Sep 25, 2016, 6:51 AM Tushar Gosavi <tus...@datatorrent.com>
> wrote:
>
> > +1
> >
> > -Tushar
> >
> > On Sun, Sep 25, 2016, 8:54 AM Sanjay Pujare <san...@datatorrent.com>
> > wrote:
> >
> > > +1
> > >
> > > Sanjay
> > >
> > >
> > > On Sun, Sep 25, 2016 at 7:06 AM, Devendra Tagare <
> > > devend...@datatorrent.com>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Thanks,
> > > > Dev
> > > >
> > > > On Sep 25, 2016 1:17 AM, "Pramod Immaneni" <pra...@datatorrent.com>
> > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > > On Sep 24, 2016, at 10:01 AM, Vlad Rozov <
> v.ro...@datatorrent.com>
> > > > > wrote:
> > > > > >
> > > > > > IMO, it may be useful to provide checkpoint statistics for
> example,
> > > > > total size of checkpoint for particular window or average size of
> > > > > checkpoints for a particular operator. Also, how long it takes to
> > write
> > > > > checkpoints to storage.
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > > Vlad
> > > > >
> > > >
> > >
> >
>

Reply via email to