Re: checkpoint statistics
+1 for this feature. The size and time to checkpoint the state at operator level will help in tuning and understanding the overheads if any. -Venkatesh. > On Sep 25, 2016, at 10:56 PM, Chinmay Kolhatkar> wrote: > > +1. very useful feature. We should also provide doc on how to use that > information for tuning. > > On Sun, Sep 25, 2016 at 11:27 PM, Thomas Weise > wrote: > >> +1 very useful during tuning and ongoing monitoring for cost of >> checkpointing (both, serialization and io). Can also be used to identify >> skew. >> >> -- >> sent from mobile >> On Sep 25, 2016 9:10 AM, "Munagala Ramanath" wrote: >> >>> We've seen cases where operator state continues to grow without bound >>> either because >>> the developer was unaware of the importance of keeping state small or >>> because of some >>> anomaly downstream. In such cases, the operators could get killed with an >>> OOM exception because >>> these checkpoints are building up in memory faster than they can be >> written >>> to disk. >>> >>> These stats may be useful in such cases to identify the root cause of >>> failure. >>> >>> Ram >>> >>> On Sun, Sep 25, 2016 at 7:39 AM, Sandesh Hegde >>> wrote: >>> Say it takes x MB size and y seconds to do the checkpoint. What does >> the user do with that information? On Sun, Sep 25, 2016, 6:51 AM Tushar Gosavi wrote: > +1 > > -Tushar > > On Sun, Sep 25, 2016, 8:54 AM Sanjay Pujare > wrote: > >> +1 >> >> Sanjay >> >> >> On Sun, Sep 25, 2016 at 7:06 AM, Devendra Tagare < >> devend...@datatorrent.com> >> wrote: >> >>> +1 >>> >>> Thanks, >>> Dev >>> >>> On Sep 25, 2016 1:17 AM, "Pramod Immaneni" < >> pra...@datatorrent.com >> wrote: >>> +1 > On Sep 24, 2016, at 10:01 AM, Vlad Rozov < v.ro...@datatorrent.com> wrote: > > IMO, it may be useful to provide checkpoint statistics for example, total size of checkpoint for particular window or average size >> of checkpoints for a particular operator. Also, how long it takes >> to > write checkpoints to storage. > > Thank you, > > Vlad >>> >> > >>> >>
Re: checkpoint statistics
+1 very useful during tuning and ongoing monitoring for cost of checkpointing (both, serialization and io). Can also be used to identify skew. -- sent from mobile On Sep 25, 2016 9:10 AM, "Munagala Ramanath"wrote: > We've seen cases where operator state continues to grow without bound > either because > the developer was unaware of the importance of keeping state small or > because of some > anomaly downstream. In such cases, the operators could get killed with an > OOM exception because > these checkpoints are building up in memory faster than they can be written > to disk. > > These stats may be useful in such cases to identify the root cause of > failure. > > Ram > > On Sun, Sep 25, 2016 at 7:39 AM, Sandesh Hegde > wrote: > > > Say it takes x MB size and y seconds to do the checkpoint. What does the > > user do with that information? > > > > On Sun, Sep 25, 2016, 6:51 AM Tushar Gosavi > > wrote: > > > > > +1 > > > > > > -Tushar > > > > > > On Sun, Sep 25, 2016, 8:54 AM Sanjay Pujare > > > wrote: > > > > > > > +1 > > > > > > > > Sanjay > > > > > > > > > > > > On Sun, Sep 25, 2016 at 7:06 AM, Devendra Tagare < > > > > devend...@datatorrent.com> > > > > wrote: > > > > > > > > > +1 > > > > > > > > > > Thanks, > > > > > Dev > > > > > > > > > > On Sep 25, 2016 1:17 AM, "Pramod Immaneni" > > > > > wrote: > > > > > > > > > > > +1 > > > > > > > > > > > > > On Sep 24, 2016, at 10:01 AM, Vlad Rozov < > > v.ro...@datatorrent.com> > > > > > > wrote: > > > > > > > > > > > > > > IMO, it may be useful to provide checkpoint statistics for > > example, > > > > > > total size of checkpoint for particular window or average size of > > > > > > checkpoints for a particular operator. Also, how long it takes to > > > write > > > > > > checkpoints to storage. > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > Vlad > > > > > > > > > > > > > > > > > > > > >
Re: checkpoint statistics
We've seen cases where operator state continues to grow without bound either because the developer was unaware of the importance of keeping state small or because of some anomaly downstream. In such cases, the operators could get killed with an OOM exception because these checkpoints are building up in memory faster than they can be written to disk. These stats may be useful in such cases to identify the root cause of failure. Ram On Sun, Sep 25, 2016 at 7:39 AM, Sandesh Hegdewrote: > Say it takes x MB size and y seconds to do the checkpoint. What does the > user do with that information? > > On Sun, Sep 25, 2016, 6:51 AM Tushar Gosavi > wrote: > > > +1 > > > > -Tushar > > > > On Sun, Sep 25, 2016, 8:54 AM Sanjay Pujare > > wrote: > > > > > +1 > > > > > > Sanjay > > > > > > > > > On Sun, Sep 25, 2016 at 7:06 AM, Devendra Tagare < > > > devend...@datatorrent.com> > > > wrote: > > > > > > > +1 > > > > > > > > Thanks, > > > > Dev > > > > > > > > On Sep 25, 2016 1:17 AM, "Pramod Immaneni" > > > wrote: > > > > > > > > > +1 > > > > > > > > > > > On Sep 24, 2016, at 10:01 AM, Vlad Rozov < > v.ro...@datatorrent.com> > > > > > wrote: > > > > > > > > > > > > IMO, it may be useful to provide checkpoint statistics for > example, > > > > > total size of checkpoint for particular window or average size of > > > > > checkpoints for a particular operator. Also, how long it takes to > > write > > > > > checkpoints to storage. > > > > > > > > > > > > Thank you, > > > > > > > > > > > > Vlad > > > > > > > > > > > > > > >
Re: checkpoint statistics
Say it takes x MB size and y seconds to do the checkpoint. What does the user do with that information? On Sun, Sep 25, 2016, 6:51 AM Tushar Gosaviwrote: > +1 > > -Tushar > > On Sun, Sep 25, 2016, 8:54 AM Sanjay Pujare > wrote: > > > +1 > > > > Sanjay > > > > > > On Sun, Sep 25, 2016 at 7:06 AM, Devendra Tagare < > > devend...@datatorrent.com> > > wrote: > > > > > +1 > > > > > > Thanks, > > > Dev > > > > > > On Sep 25, 2016 1:17 AM, "Pramod Immaneni" > > wrote: > > > > > > > +1 > > > > > > > > > On Sep 24, 2016, at 10:01 AM, Vlad Rozov > > > > wrote: > > > > > > > > > > IMO, it may be useful to provide checkpoint statistics for example, > > > > total size of checkpoint for particular window or average size of > > > > checkpoints for a particular operator. Also, how long it takes to > write > > > > checkpoints to storage. > > > > > > > > > > Thank you, > > > > > > > > > > Vlad > > > > > > > > > >
Re: checkpoint statistics
+1 -Tushar On Sun, Sep 25, 2016, 8:54 AM Sanjay Pujarewrote: > +1 > > Sanjay > > > On Sun, Sep 25, 2016 at 7:06 AM, Devendra Tagare < > devend...@datatorrent.com> > wrote: > > > +1 > > > > Thanks, > > Dev > > > > On Sep 25, 2016 1:17 AM, "Pramod Immaneni" > wrote: > > > > > +1 > > > > > > > On Sep 24, 2016, at 10:01 AM, Vlad Rozov > > > wrote: > > > > > > > > IMO, it may be useful to provide checkpoint statistics for example, > > > total size of checkpoint for particular window or average size of > > > checkpoints for a particular operator. Also, how long it takes to write > > > checkpoints to storage. > > > > > > > > Thank you, > > > > > > > > Vlad > > > > > >
Re: checkpoint statistics
+1. Very important stat for deciding a crucial question -> "Whether to checkpoint an operator?". It affects SLA, design, ... Thks Amol On Sat, Sep 24, 2016 at 10:01 AM, Vlad Rozovwrote: > IMO, it may be useful to provide checkpoint statistics for example, total > size of checkpoint for particular window or average size of checkpoints for > a particular operator. Also, how long it takes to write checkpoints to > storage. > > Thank you, > > Vlad >