> On Jan. 24, 2013, 5:22 p.m., Charles Reiss wrote: > > src/slave/slave.cpp, line 1084 > > <https://reviews.apache.org/r/9095/diff/1/?file=251559#file251559line1084> > > > > Aren't you going to be missing the last resource sample for the > > executor (perhaps the only one for an, e.g., crash-looping executor)? > > Charles Reiss wrote: > Okay, sorry, I really should have looked more at what archive() did > before writing that. However, I think you have a problem with the > frameworkId/executorId pairs not being unique over a window when an executor > gets restarted with the same ID (crash-loop scenario is the obvious case > where this is likely again). > > Ben Mahler wrote: > Good point, there's definitely a bug here: > > -Executor 1 terminates > -Archive stats for Executor 1 > -Executor 1 runs again on the same slave > -We collect and export resource usage to STATS for Executor 1. > > Now, Executor 1 incorrectly remains archived, and while it will show up > in the usage.json endpoint, it will never show up again in the statistics > snapshot.json. > The fix here is when a new statistics comes in, to ensure it's not > archived. I'll make that fix in https://reviews.apache.org/r/9093/ > > Were there any other issues here? > > Charles Reiss wrote: > I didn't see anything else broken, though I didn't look very hard. > > I would have preferred/expected if statistics would be separate for > separate executor attempts (e.g. keyed by the slave's UUID, which likely > requires an IsolationModule API change to support), but it's not a big deal.
Right, it would require an isolation module API change, at least with the way I've designed it. I think two things are useful here: (1) Statistics per executor run (2) Statistics across executor runs I've designed for (2) simply because it was easier given the current API, but I think for the webui (1) is indeed more useful. I'll think about this. - Ben ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/9095/#review15645 ----------------------------------------------------------- On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/9095/ > ----------------------------------------------------------- > > (Updated Jan. 24, 2013, 9:22 a.m.) > > > Review request for mesos, Benjamin Hindman and Vinod Kone. > > > Description > ------- > > This wires up the archival of terminated executor stats. > > > This addresses bug MESOS-324. > https://issues.apache.org/jira/browse/MESOS-324 > > > Diffs > ----- > > src/slave/monitor.hpp PRE-CREATION > src/slave/monitor.cpp PRE-CREATION > src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 > > Diff: https://reviews.apache.org/r/9095/diff/ > > > Testing > ------- > > make check > > > Thanks, > > Ben Mahler > >
