----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71343/#review217419 -----------------------------------------------------------
Fix it, then Ship it! src/slave/slave.cpp Lines 10775-10776 (patched) <https://reviews.apache.org/r/71343/#comment304700> It does not seem likely to crash. But to be safe, could we consider to relax these CHECKs? E.g., just log a warning and return? - Gilbert Song On Aug. 21, 2019, 10:53 a.m., Andrei Budnik wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71343/ > ----------------------------------------------------------- > > (Updated Aug. 21, 2019, 10:53 a.m.) > > > Review request for mesos, Gilbert Song, Greg Mann, and Qian Zhang. > > > Bugs: MESOS-9887 > https://issues.apache.org/jira/browse/MESOS-9887 > > > Repository: mesos > > > Description > ------- > > Previously, Mesos agent could send TASK_FAILED status update on > executor termination while processing of TASK_FINISHED status update > was in progress. Processing of task status updates involves sending > requests to the containerizer, which might finish processing of these > requests out-of-order, e.g. `MesosContainerizer::status`. Also, > the agent does not overwrite status of the terminal status update once > it's stored in the `terminatedTasks`. Hence, there was a race condition > between two terminal status updates. > > Note that V1 Executors are not affected by this problem because they > wait for an acknowledgement of the terminal status update by the agent > before terminating. > > This patch introduces a new data structure `pendingStatusUpdates`, > which holds a list of status updates that are being processed. This > data structure allows validating the order of processing of status > updates by the agent. > > > Diffs > ----- > > src/slave/slave.hpp a17bbee13cb8291ad694f1520b613764b57b046b > src/slave/slave.cpp 1d0ec9d2428c3ffa28ad3e960b74f171013cf0c2 > > > Diff: https://reviews.apache.org/r/71343/diff/2/ > > > Testing > ------- > > 1. manual testing described in MESOS-9887 > 2. internal CI > > > Thanks, > > Andrei Budnik > >