> On March 12, 2019, 10:56 p.m., Joseph Wu wrote: > > src/master/master.cpp > > Lines 11361-11385 (original), 11378-11402 (patched) > > <https://reviews.apache.org/r/70116/diff/4/?file=2130965#file2130965line11380> > > > > I'm curious what would be the proper way to handle operation > > cleanup/removal. > > > > When an operation is transitioned into a terminal state, the master > > will usually `removeOperation(...)` shortly afterwards. Since we don't > > decrement the metrics in this case, the number of terminal operations will > > continue to grow. This seems like the proper behavior. > > > > However, in this code, it is possible to remove an agent with > > non-terminal operations. This means the non-terminal metrics will never be > > decremented. So you can have a cluster with 0 operations, but the metric > > for pending operations might be non-zero. > > Benno Evers wrote: > Hm, good question. I think the only ways a slave gets removed while it > still has operations pending is by either being marked gone, or becoming > unreachable. > > In both cases we already transition the counters to the correct > `OPERATION_GONE`/`OPERATION_UNREACHABLE` states. (although unfortunately in a > somewhat non-local manner, that's what https://reviews.apache.org/r/70185/ is > all about) > > For gone operations, this should be fine. However, the problem is that > when an unreachable slave reregisters, we re-add all operations as new > operations without decrementing the `operations_unreachable` metric, since at > the time the `UpdateSlaveMessage` arrives the master already forgot that the > slave was previously unreachable. > > So as far as I can see, the metrics for pending operations should always > be correct, but it is possible to overcount unreachable operations. > > It's not clear if this can be fixed without quite far-reaching > refactoring in the master. So I think the best course might be to either > document this behaviour, or remove the `operations_unreachable` metric > altogether. > > What do you think? > > Joseph Wu wrote: > It may be worthwhile to add a CHECK to make sure we only ever remove > terminal (including GONE) or UNREACHABLE operations. > > --- > > Per endless (possible) recounting of unreachable operations, I'd lean > towards overcounting & documenting how/when this is possible. Probably > inside the operator document that lists/describes all the metrics. > > Greg Mann wrote: > "the only ways a slave gets removed while it still has operations pending > is by either being marked gone, or becoming unreachable" > > This isn't true - for example, an agent could send an > `UnregisterSlaveMessage` when it has pending operations and it will be > removed via `Master::_removeSlave()`. Unfortunately, agent removal in that > method currently does not correspond to any well-defined agent states; see > https://issues.apache.org/jira/browse/MESOS-9556. > > I think that we need to decrement non-terminal operation states when > operations are removed in `_removeSlave()`.
I'm now decrementing non-terminal states (and completely removed https://reviews.apache.org/r/70185/ while re-working the accounting). - Benno ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70116/#review213632 ----------------------------------------------------------- On March 26, 2019, 5:56 p.m., Benno Evers wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70116/ > ----------------------------------------------------------- > > (Updated March 26, 2019, 5:56 p.m.) > > > Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu. > > > Bugs: MESOS-8241 > https://issues.apache.org/jira/browse/MESOS-8241 > > > Repository: mesos > > > Description > ------- > > This commit adds additional metrics counting the > number of operations in each state. > > Unit tests are added in the subsequent commit. > > > Diffs > ----- > > docs/monitoring.md 54b872f579dbc68ca5f67f4cc1ba34065a09aee2 > src/master/master.cpp 9c4a9e83da94535873d72c902835f229c4f96320 > src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa > src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 > > > Diff: https://reviews.apache.org/r/70116/diff/5/ > > > Testing > ------- > > > Thanks, > > Benno Evers > >
