Re: checkpoint statistics

2016-09-26 Thread Venkatesh Kottapalli
+1 for this feature. The size and time to checkpoint the state at operator 
level will help in tuning and understanding the overheads if any.


-Venkatesh.

> On Sep 25, 2016, at 10:56 PM, Chinmay Kolhatkar  
> wrote:
> 
> +1. very useful feature. We should also provide doc on how to use that
> information for tuning.
> 
> On Sun, Sep 25, 2016 at 11:27 PM, Thomas Weise 
> wrote:
> 
>> +1 very useful during tuning and ongoing monitoring for cost of
>> checkpointing (both, serialization and io). Can also be used to identify
>> skew.
>> 
>> --
>> sent from mobile
>> On Sep 25, 2016 9:10 AM, "Munagala Ramanath"  wrote:
>> 
>>> We've seen  cases where operator state continues to grow without bound
>>> either because
>>> the developer was unaware of the importance of keeping state small or
>>> because of some
>>> anomaly downstream. In such cases, the operators could get killed with an
>>> OOM exception because
>>> these checkpoints are building up in memory faster than they can be
>> written
>>> to disk.
>>> 
>>> These stats may be useful in such cases to identify the root cause of
>>> failure.
>>> 
>>> Ram
>>> 
>>> On Sun, Sep 25, 2016 at 7:39 AM, Sandesh Hegde 
>>> wrote:
>>> 
 Say it takes x MB size and y seconds to do the checkpoint. What does
>> the
 user do with that information?
 
 On Sun, Sep 25, 2016, 6:51 AM Tushar Gosavi 
 wrote:
 
> +1
> 
> -Tushar
> 
> On Sun, Sep 25, 2016, 8:54 AM Sanjay Pujare 
> wrote:
> 
>> +1
>> 
>> Sanjay
>> 
>> 
>> On Sun, Sep 25, 2016 at 7:06 AM, Devendra Tagare <
>> devend...@datatorrent.com>
>> wrote:
>> 
>>> +1
>>> 
>>> Thanks,
>>> Dev
>>> 
>>> On Sep 25, 2016 1:17 AM, "Pramod Immaneni" <
>> pra...@datatorrent.com
 
>> wrote:
>>> 
 +1
 
> On Sep 24, 2016, at 10:01 AM, Vlad Rozov <
 v.ro...@datatorrent.com>
 wrote:
> 
> IMO, it may be useful to provide checkpoint statistics for
 example,
 total size of checkpoint for particular window or average size
>> of
 checkpoints for a particular operator. Also, how long it takes
>> to
> write
 checkpoints to storage.
> 
> Thank you,
> 
> Vlad
 
>>> 
>> 
> 
 
>>> 
>> 



Re: checkpoint statistics

2016-09-25 Thread Thomas Weise
+1 very useful during tuning and ongoing monitoring for cost of
checkpointing (both, serialization and io). Can also be used to identify
skew.

--
sent from mobile
On Sep 25, 2016 9:10 AM, "Munagala Ramanath"  wrote:

> We've seen  cases where operator state continues to grow without bound
> either because
> the developer was unaware of the importance of keeping state small or
> because of some
> anomaly downstream. In such cases, the operators could get killed with an
> OOM exception because
> these checkpoints are building up in memory faster than they can be written
> to disk.
>
> These stats may be useful in such cases to identify the root cause of
> failure.
>
> Ram
>
> On Sun, Sep 25, 2016 at 7:39 AM, Sandesh Hegde 
> wrote:
>
> > Say it takes x MB size and y seconds to do the checkpoint. What does the
> > user do with that information?
> >
> > On Sun, Sep 25, 2016, 6:51 AM Tushar Gosavi 
> > wrote:
> >
> > > +1
> > >
> > > -Tushar
> > >
> > > On Sun, Sep 25, 2016, 8:54 AM Sanjay Pujare 
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Sanjay
> > > >
> > > >
> > > > On Sun, Sep 25, 2016 at 7:06 AM, Devendra Tagare <
> > > > devend...@datatorrent.com>
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Thanks,
> > > > > Dev
> > > > >
> > > > > On Sep 25, 2016 1:17 AM, "Pramod Immaneni"  >
> > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > > On Sep 24, 2016, at 10:01 AM, Vlad Rozov <
> > v.ro...@datatorrent.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > IMO, it may be useful to provide checkpoint statistics for
> > example,
> > > > > > total size of checkpoint for particular window or average size of
> > > > > > checkpoints for a particular operator. Also, how long it takes to
> > > write
> > > > > > checkpoints to storage.
> > > > > > >
> > > > > > > Thank you,
> > > > > > >
> > > > > > > Vlad
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: checkpoint statistics

2016-09-25 Thread Munagala Ramanath
We've seen  cases where operator state continues to grow without bound
either because
the developer was unaware of the importance of keeping state small or
because of some
anomaly downstream. In such cases, the operators could get killed with an
OOM exception because
these checkpoints are building up in memory faster than they can be written
to disk.

These stats may be useful in such cases to identify the root cause of
failure.

Ram

On Sun, Sep 25, 2016 at 7:39 AM, Sandesh Hegde 
wrote:

> Say it takes x MB size and y seconds to do the checkpoint. What does the
> user do with that information?
>
> On Sun, Sep 25, 2016, 6:51 AM Tushar Gosavi 
> wrote:
>
> > +1
> >
> > -Tushar
> >
> > On Sun, Sep 25, 2016, 8:54 AM Sanjay Pujare 
> > wrote:
> >
> > > +1
> > >
> > > Sanjay
> > >
> > >
> > > On Sun, Sep 25, 2016 at 7:06 AM, Devendra Tagare <
> > > devend...@datatorrent.com>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Thanks,
> > > > Dev
> > > >
> > > > On Sep 25, 2016 1:17 AM, "Pramod Immaneni" 
> > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > > On Sep 24, 2016, at 10:01 AM, Vlad Rozov <
> v.ro...@datatorrent.com>
> > > > > wrote:
> > > > > >
> > > > > > IMO, it may be useful to provide checkpoint statistics for
> example,
> > > > > total size of checkpoint for particular window or average size of
> > > > > checkpoints for a particular operator. Also, how long it takes to
> > write
> > > > > checkpoints to storage.
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > > Vlad
> > > > >
> > > >
> > >
> >
>


Re: checkpoint statistics

2016-09-25 Thread Sandesh Hegde
Say it takes x MB size and y seconds to do the checkpoint. What does the
user do with that information?

On Sun, Sep 25, 2016, 6:51 AM Tushar Gosavi  wrote:

> +1
>
> -Tushar
>
> On Sun, Sep 25, 2016, 8:54 AM Sanjay Pujare 
> wrote:
>
> > +1
> >
> > Sanjay
> >
> >
> > On Sun, Sep 25, 2016 at 7:06 AM, Devendra Tagare <
> > devend...@datatorrent.com>
> > wrote:
> >
> > > +1
> > >
> > > Thanks,
> > > Dev
> > >
> > > On Sep 25, 2016 1:17 AM, "Pramod Immaneni" 
> > wrote:
> > >
> > > > +1
> > > >
> > > > > On Sep 24, 2016, at 10:01 AM, Vlad Rozov 
> > > > wrote:
> > > > >
> > > > > IMO, it may be useful to provide checkpoint statistics for example,
> > > > total size of checkpoint for particular window or average size of
> > > > checkpoints for a particular operator. Also, how long it takes to
> write
> > > > checkpoints to storage.
> > > > >
> > > > > Thank you,
> > > > >
> > > > > Vlad
> > > >
> > >
> >
>


Re: checkpoint statistics

2016-09-25 Thread Tushar Gosavi
+1

-Tushar

On Sun, Sep 25, 2016, 8:54 AM Sanjay Pujare  wrote:

> +1
>
> Sanjay
>
>
> On Sun, Sep 25, 2016 at 7:06 AM, Devendra Tagare <
> devend...@datatorrent.com>
> wrote:
>
> > +1
> >
> > Thanks,
> > Dev
> >
> > On Sep 25, 2016 1:17 AM, "Pramod Immaneni" 
> wrote:
> >
> > > +1
> > >
> > > > On Sep 24, 2016, at 10:01 AM, Vlad Rozov 
> > > wrote:
> > > >
> > > > IMO, it may be useful to provide checkpoint statistics for example,
> > > total size of checkpoint for particular window or average size of
> > > checkpoints for a particular operator. Also, how long it takes to write
> > > checkpoints to storage.
> > > >
> > > > Thank you,
> > > >
> > > > Vlad
> > >
> >
>


Re: checkpoint statistics

2016-09-24 Thread Amol Kekre
+1. Very important stat for deciding a crucial question -> "Whether to
checkpoint an operator?". It affects SLA, design, ...

Thks
Amol


On Sat, Sep 24, 2016 at 10:01 AM, Vlad Rozov 
wrote:

> IMO, it may be useful to provide checkpoint statistics for example, total
> size of checkpoint for particular window or average size of checkpoints for
> a particular operator. Also, how long it takes to write checkpoints to
> storage.
>
> Thank you,
>
> Vlad
>