Re: Review Request 27531: Update Master metrics to match task source and reason scheme.

Ben Mahler Mon, 09 Feb 2015 12:26:16 -0800


> On Feb. 6, 2015, 6:56 p.m., Vinod Kone wrote:
> > src/master/master.cpp, line 3601
> > <https://reviews.apache.org/r/27531/diff/7/?file=852243#file852243line3601>
> >
> >     Looks like you missed updating metrics here?
> >     
> >     Now that I think about it, it's worthwhile to consolidate metrics 
> > updates into one place to avoid misses like these in the future. How about 
> > calling incrementTasksStates() in one place, inside forward(). forward() is 
> > called by every status update generated/received by the master (exception 
> > is reconcileTasks() but it should probably use forward() too).
> >     
> >     ```
> >     void forward(
> >         const StatusUpdate& update,
> >         const UPID& acknowledgee,
> >         Framework* framework)
> >     {
> >       ....
> >     
> >       // Update terminal state metrics.
> >       if (protobuf::isTerminalState(update.status.state()) {
> >         incrementTasksStates(...);
> >         
> >         switch (update.status.state()) {
> >           case TASK_FINISHED: ++metrics->tasks_finished; break;
> >           case TASK_FAILED:   ++metrics->tasks_failed;   break;
> >           case TASK_KILLED:   ++metrics->tasks_killed;   break;
> >           case TASK_LOST:     ++metrics->tasks_lost;     break;
> >           default: break;
> >         }
> >       }
> >     }
> >     ```
> 
> Vinod Kone wrote:
>     actually, both reconcileTasks() and removeFramework() seem to be 
> exceptions. hmmm..
> 
> Dominic Hamon wrote:
>     i like this. i did a scan of uses of 'forward' originally and thought it 
> was being called in too many places to be a good candidate for when we want 
> to update the metrics. filtering on terminal states is the right way to do it 
> though.
>     
>     and yes, reconciliation is subtle. if it does use forward, we may end up 
> counting duplicates, though the reason will be 'reconciliation' so it should 
> be ok.
> 
> Dominic Hamon wrote:
>     removeFramework calls updateTask which updates terminal states.
> 
> Dominic Hamon wrote:
>     statusUpdate calls both forward() and updateTask() so that might double 
> count now.
> 
> Vinod Kone wrote:
>     i think we need to decide what these counters are actually counting. are 
> these counting the number of state transitions to terminal states or number 
> of status updates being forwarded in this state. the 'stats' metric was 
> counting the latter. but, if we are not worried about maintaing parity (we 
> already don't because we have gauges for non-terminal states in 'metrics' as 
> opposed to counters in 'stats') i think it makes sense to count the former. 
> in that case, ignore my comments here. what you have is good because 
>     1) updateTask() is called for all transitions and we update terminal 
> metrics there
>     2) no need to update terminal metrics in reconcileTasks() because they 
> are not state transitions
>     3) terminal metrics are updated in launchTasks/acceptOffers because we 
> can think of them as transitions from not-existing to TASK_LOST/TASK_ERROR
>     
>     thoughts? cc @bmahler
> 
> Dominic Hamon wrote:
>     the gauges we have for non-terminal are a change from stats. They capture 
> the current number of tasks in a given state. This is, i think, a bug and we 
> should have considered this more carefully.
>     
>     I think we really want to know how many updates we're forwarding in a 
> given state/source/reason tuple. Terminal states is only a thing now because 
> reasons only apply to terminal states. Conceptually, any state transition can 
> have a reason, so this is an artificial limitation.
>     
>     As such, I think we want the latter and to make sure we're capturing all 
> the status updates being forwarded. As such, putting the code in 'forward' is 
> the right thing to do. This means that we won't capture 'TASK_KILLED' from 
> removeFramework unless we add a special case, and we won't capture 
> reconciliation until that uses forward. Removing the metric update from 
> updateTask and putting it in forward seems like it does the right thing.


If we want to count status updates, perhaps those metrics should be added 
separately and named explicitly? e.g. `status_update_task_failed`.

Seeing `task_failed` I would imagine it counting the number of failed tasks, 
seems unintuitive for it to not match the number of failed tasks (by counting 
the number of status updates with TASK_FAILED).


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27531/#review71470
-----------------------------------------------------------


On Feb. 6, 2015, 11:22 p.m., Dominic Hamon wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27531/
> -----------------------------------------------------------
> 
> (Updated Feb. 6, 2015, 11:22 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1830
>     https://issues.apache.org/jira/browse/MESOS-1830
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Update metrics in Master to match the source and reason split for task 
> statuses.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp c3c77f840f089c5754c764e7f150a3c1971d311f 
>   src/master/master.cpp 22fece79c6e511a1b61eb674d7f82216e7a25e00 
>   src/master/metrics.hpp ee8b5bc76336ece9192455b216187482cee29420 
>   src/master/metrics.cpp 0b3b91e744cdb76cd843edfcf34b12b842fcb06e 
>   src/slave/slave.hpp 9adee17cb94a72f0e1e139b3fd8978a9a1ff6237 
>   src/slave/slave.cpp 336e8776be656c5e6ad1b0d997f54c307f4559f8 
>   src/tests/master_authorization_tests.cpp 
> 6fd0efa444dce5203e2e755d4de0483fcaa277f8 
>   src/tests/master_slave_reconciliation_tests.cpp 
> 09742855c9fcd6bd10ab79af582b14fba2e6b609 
>   src/tests/master_tests.cpp 62ba35b9c3f999c59a95bffb01b8b82cc543a34f 
>   src/tests/mesos.hpp 83a369968ab2403fa341829ac5d11f7243095190 
>   src/tests/mesos.cpp 21a405366f56c963611324076efe775f85b9d9f7 
>   src/tests/rate_limiting_tests.cpp 8b55bffff0b7713806fc0bcc63d8a7097a8408e0 
>   src/tests/slave_tests.cpp 68a6498cd86723e571b262b2495b4b504e744428 
> 
> Diff: https://reviews.apache.org/r/27531/diff/
> 
> 
> Testing
> -------
> 
> added metric tests to master tests
> make check
> 
> 
> Thanks,
> 
> Dominic Hamon
> 
>

Re: Review Request 27531: Update Master metrics to match task source and reason scheme.

Reply via email to