> On June 12, 2015, 6:43 p.m., Jie Yu wrote: > > src/slave/slave.cpp, lines 4383-4400 > > <https://reviews.apache.org/r/34720/diff/4/?file=983235#file983235line4383> > > > > First of all, we don't need to check if 'executor' is NULL or not here. > > If you looked at the callers, it's guaranteed that 'executor' will be > > non-NULL. So you need a CHECK here. > > > > Also, I thought about what the correct logic should be in this funtion > > (see below). Let me know your thoughts. (we have too many tech debts in > > this function, so we better fix this asap, instead of waiting further). > > > > ``` > > TaskState state; > > TaskStatus::Reason reason; > > string message; > > > > // Determine the state and reason for the status update. > > if (termination.isReady() && termination.get().has_reason()) { > > // This is the case where the resource limit has been > > // reached and the container destroyed itself. > > state = TASK_FAILED; > > reason = termination.get().reason(); > > } else if (executor->reason.isSome()) { > > // This is the case where the slave initiates the > > // termination of the container. > > > > // NOTE: For command executors, previously we send TASK_FAILED > > // and REASON_COMMAND_EXECUTOR_FAILED when the command executor > > // failed to launch (e.g., invalid executor path). > > if (executor->isCommandExecutor() && > > executor->reason.get() == REASON_EXECUTOR_REGISTRATION_TIMEOUT) { > > state = TASK_FAILED; > > reason = REASON_COMMAND_EXECUTOR_FAILED; > > } else { > > state = TASK_LOST; > > reason = executor->reason.get(); > > } > > } else { > > state = TASK_LOST; > > reason = TaskStatus::REASON_EXECUTOR_TERMINATED > > } > > > > // Determine the message for the status update. > > // TODO(nnielsen): Consider adding a message from the slave > > // as well, and concatenating the messages. > > if (termination.isReady()) { > > message = termination.get().message(); > > } else { > > // This is case where the containerizer fails > > // to destroy the container. > > message = "Abnormal executor termination"; > > } > > > > statusUpdate(...); > > ``` > > Jie Yu wrote: > Forgot to mention that you need to set executor->reason accordingly when > slave initiates a destroy of the container (e.g., containerizer->update > fails, executor registration timeout, etc.). > > Vinod Kone wrote: > Changing the command task state from TASK_FAILED to TASK_LOST is an API > change! It should've a deprecation cycle with a headsup given on the dev > list. I would recommend doing: > > ``` > // Determine the state and reason for the status update. > if (termination.isReady() && termination.get().has_reason()) { > // This is the case where the resource limit has been > // reached and the container destroyed itself. > state = TASK_FAILED; > reason = termination.get().reason(); > } else if (executor->isCommandExecutor()) { > // TODO(jieyu): ..... > state = TASK_FAILED; > reason = REASON_COMMAND_EXECUTOR_FAILED; > } else if (executor->reason.isSome()) { > // This is the case where the slave initiates the > // termination of the container. > state = TASK_LOST; > reason = executor->reason.get(); > } else { > state = TASK_LOST; > reason = TaskStatus::REASON_EXECUTOR_TERMINATED > } > > ```
OK, that's why exactly the reason why I want this to be fixed this asap (otherwise, we'll have to deal with another API breaking change). Vinod, your code does not work for niq. For example, if a command executor is killed due to QOS violation, we will send TASK_FAILED and REASON_COMMAND_EXECUTOR_FAILED. That's not correct. - Jie ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34720/#review87734 ----------------------------------------------------------- On June 11, 2015, 10:58 p.m., Niklas Nielsen wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/34720/ > ----------------------------------------------------------- > > (Updated June 11, 2015, 10:58 p.m.) > > > Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and > Vinod Kone. > > > Bugs: MESOS-2653 > https://issues.apache.org/jira/browse/MESOS-2653 > > > Repository: mesos > > > Description > ------- > > See summary > > > Diffs > ----- > > src/slave/slave.hpp 0df1b55791963fb4159b7ea5318d09dde4f7d8c7 > src/slave/slave.cpp b523c2fce50e56f4f94d55a9488f49c53452e4d4 > > Diff: https://reviews.apache.org/r/34720/diff/ > > > Testing > ------- > > make check > > > Thanks, > > Niklas Nielsen > >