[
https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262184#comment-13262184
]
[email protected] commented on MESOS-110:
-----------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4462/#review7232
-----------------------------------------------------------
I've only gotten halfway through ... but there is a bunch here already. I'd
like to break this up into at least four patches. (1) The utils stuff that was
added. (2) The master changes. (3) The slave::path namespace stuff. (3) The
status update manager API + implementation (but not the slave using it yet).
And (4) the slave using each of these components, and the executor changes that
are included.
These comments are across all of those patches, but I'll make future passes on
each of those components.
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15918>
Result is deprecated. If we can replace it with Try, please do.
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15920>
Add that this is at the current file position of the file descriptor.
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15919>
How is this helpful? (If this came from my code, it should be removed.)
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15926>
I'd prefer if this did not seek to the beginning and read the file, but
rather read from the current position until the end (and have the comment say
as much).
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15924>
Blah.
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15923>
This looks like a bug ('offset' as the third argument?).
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15925>
No need for the space here though.
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15927>
Again, should be killed (only makes sense in a macro).
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15928>
You should refactor the protobuf::read and protobuf::write to use these
versions of read and write now as well.
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15929>
man 3 dirname (and use it please).
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15930>
s/file_pattern/pattern
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15932>
s/result/results
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15936>
Kill.
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15931>
Why not return a Try instead?
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15935>
Why is this a hack?
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15933>
s/p/result or s/p/path
src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15934>
Kill.
src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15949>
It is not possible that the SlaveInfo could have changed? It seems like
that needs to get passed along as well.
src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15937>
s/on/to
Also, put this after the CHECK below.
src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15943>
Should this be a 'shutdown' instead of a CHECK?
src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15939>
s/ Expected/, expected
src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15940>
s/ Received/, received
src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15941>
s/UPID/const UPID&
src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15942>
Space.
src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15944>
This relies on a side-effect of shutdown, namely that it calls exit(0). But
that means if we try and test this case in "local" mode we'll end up falling
out of this if block and executing the rest of the code in this function.
Please fix.
src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15945>
This needs a comment about why it's the case that a task will only ever be
sent from the slave to the executor once (and thus, the check is warranted).
src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15947>
Maybe we only do this in the else branch below?
src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15948>
I'd prefer to call 'shutdown' here, even if it means revisiting/refactoring
that function given my comments above.
src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15946>
I'd prefer to just call this 'tasks'.
src/launcher/launcher.cpp
<https://reviews.apache.org/r/4462/#comment15950>
Kill.
src/local/local.hpp
<https://reviews.apache.org/r/4462/#comment15951>
s/recovery/recover
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15952>
s/slave "/slave on "
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15953>
Ditto above.
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15954>
I suggest just killing this LOG line and keeping the new one you added
below.
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15955>
s/"("/" ("
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15956>
s/"("/" ("
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15957>
s/slave "/slave on "
Note also that anytime we are printing out the PID, we're getting the IP,
so the hostname is not strictly necessary (there a bunch of these below).
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15958>
s/"("/" ("
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15959>
s/"("/" ("
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15960>
?
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15961>
s/"("/" ("
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15962>
s/"("/" ("
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15963>
s/"("/" ("
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15964>
s/"("/" ("
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15965>
s/"("/" ("
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15966>
s/"("/" ("
src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15967>
s/"("/" ("
src/messages/messages.proto
<https://reviews.apache.org/r/4462/#comment15969>
Newline.
src/messages/messages.proto
<https://reviews.apache.org/r/4462/#comment15968>
Newline.
src/messages/messages.proto
<https://reviews.apache.org/r/4462/#comment15970>
Seems like having the SlaveInfo would be a good thing (maybe the slave port
changed? maybe other things will be added in the future that could change?).
src/scripts/killtree.sh
<https://reviews.apache.org/r/4462/#comment15971>
How about:
echo "$(basename ${0}): '${PID}' should be a number"
src/slave/constants.hpp
<https://reviews.apache.org/r/4462/#comment15972>
This seems low ... ?
src/slave/lxc_isolation_module.cpp
<https://reviews.apache.org/r/4462/#comment15973>
Please don't use snake_case.
src/slave/lxc_isolation_module.cpp
<https://reviews.apache.org/r/4462/#comment15978>
So, this is identical to what's in process_based_isolation_module.cpp. At
the time of designing this with you my thought was that we would be writing
"different" data to disk. In particular, this might be isolation module
"specific" data. I think the right model here is really to have the isolation
module know what the meta directory is for the executor and have it's own well
defined place where it can write data. In LXC's case, we'll want to probably
write not just the pid but also the container name.
src/slave/lxc_isolation_module.cpp
<https://reviews.apache.org/r/4462/#comment15975>
So, if we are just given the directory for this executor we could just
recover the file that we expected to write (above). If we somehow need to get
this from the pid_t passed in, then we'll need to have the location for the
'cgroup' directory, from which we can look at all /path/to/cgroup/*/tasks to
see if this pid is a parent of one of those tasks. This should always work.
Again, the alternative will be to have this module actually write some of
it's own state to the meta-directory, e.g., the name of this container. This
might require calling IsolationModule::recover in the slave earlier though.
src/slave/main.cpp
<https://reviews.apache.org/r/4462/#comment15979>
I'd prefer to type on the command line:
/path/to/mesos-slave --recover
rather than:
/path/to/mesos-slave --recovery
src/slave/process_based_isolation_module.hpp
<https://reviews.apache.org/r/4462/#comment15981>
Remove newline.
src/slave/process_based_isolation_module.hpp
<https://reviews.apache.org/r/4462/#comment15982>
Why'd you move this?
src/slave/process_based_isolation_module.cpp
<https://reviews.apache.org/r/4462/#comment15983>
Not used (and it's deprecated).
src/slave/process_based_isolation_module.cpp
<https://reviews.apache.org/r/4462/#comment15984>
I'm not a fan of this factored out. It's not that much code, and I'd prefer
to see exactly what's happening here.
src/slave/process_based_isolation_module.cpp
<https://reviews.apache.org/r/4462/#comment15985>
Doesn't this happen automagically because the reaper tells us about all
child processes exiting/terminating?
src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15988>
I'd like to stick all of this stuff in it's own file and commit this on
it's own (with integration in Slave as applicable and tests).
src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15986>
s/follows/follows:
src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15987>
What is the framework PID? How is that different than the Executor PID
mentioned below?
src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15989>
Why not just get reconnect from configuration rather than passing it in
here and having another default?
src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15990>
No snake_case please.
src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15991>
Kill.
src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15992>
What's the difference between "work" and "work root"? Or "meta" and "meta
root"?
src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15993>
Makes me think it shouldn't be an instance variable or recovery needs to be
factored out into it's own process ... I think we briefly talked about this
though.
src/slave/slave.cpp
<https://reviews.apache.org/r/4462/#comment15994>
I'm afraid that even this might be too complicated. We might want a single
option 'recover', that takes a string instead of a bool, which specifies "how"
to recover. For example, --recover=reconnect might mean recover the state and
executors, where as --recover=upgrade might mean recover the state but then
kill all executors (and create/send appropriate status updates). That way, when
an operator goes to restart mesos, they are forced to specify how recovery is
done, rather than specifying one option (e.g, --recover) and forgetting the
other one and making a mistake. (Note, "reconnect" and "upgrade" might not be
the best names. Also, doing it this way might make the name 'recovery' instead
of 'recover' work ... e.g., --recovery=reconnect could be read as "the recovery
strategy is reconnect to the executors".)
src/slave/slave.cpp
<https://reviews.apache.org/r/4462/#comment15995>
Isn't the default workd_dir mentioned in the addOption above /tmp/mesos? If
so, "/tmp" should be "/tmp/mesos" here, and we probably don't want
"/tmp/mesos/mesos". Just clean this up so people understand expectations (i.e.,
should work_dir be a path including the "mesos" directory, or will we create
that ourselves).
src/slave/slave.cpp
<https://reviews.apache.org/r/4462/#comment15996>
All the more reason to merge the options.
src/slave/slave.cpp
<https://reviews.apache.org/r/4462/#comment15997>
s/Skipping/Delaying
Also, let's make this LOG(WARNING).
src/slave/slave.cpp
<https://reviews.apache.org/r/4462/#comment15999>
You should just return a value (let the compiler optimize the move), no
need to allocate on heap. In this tiny bit of code you even forget to free the
object! Also, createTask does not appear to be in this review. Let's stick it
in it's own review that gets committed ASAP.
src/slave/slave.cpp
<https://reviews.apache.org/r/4462/#comment16001>
Pass these into writeFrameworkPID, no need to make this an instance
function. More importantly, you should do this for each of these
writers/readers.
- Benjamin
On 2012-04-19 16:53:07, Vinod Kone wrote:
bq.
bq. -----------------------------------------------------------
bq. This is an automatically generated e-mail. To reply, visit:
bq. https://reviews.apache.org/r/4462/
bq. -----------------------------------------------------------
bq.
bq. (Updated 2012-04-19 16:53:07)
bq.
bq.
bq. Review request for mesos, Benjamin Hindman and John Sirois.
bq.
bq.
bq. Summary
bq. -------
bq.
bq. Sorry for the huge CL!
bq.
bq. Slave restarts now supports recovery!
bq. --> Non-disruptive restart means running tasks are not lost
bq. --> Re-connects with live executors
bq. --> Checkpoints and reliably sends status updates
bq. --> Ability to kill executors if the slave upgrade is incompatible with
running executors
bq.
bq.
bq. This addresses bug mesos-110.
bq. https://issues.apache.org/jira/browse/mesos-110
bq.
bq.
bq. Diffs
bq. -----
bq.
bq. src/Makefile.am d5edaa2
bq. src/common/hashset.hpp 1feb610
bq. src/common/utils.hpp 1d81e21
bq. src/exec/exec.cpp e8db407
bq. src/launcher/launcher.cpp a141b9a
bq. src/local/local.hpp 55f9eaf
bq. src/local/local.cpp affe432
bq. src/master/master.cpp 4dc9ee0
bq. src/messages/messages.proto 87e1548
bq. src/sched/sched.cpp dcadb10
bq. src/scripts/killtree.sh bceae9d
bq. src/slave/constants.hpp f0c8679
bq. src/slave/http.cpp 19c48a0
bq. src/slave/isolation_module.hpp c896908
bq. src/slave/lxc_isolation_module.hpp b7beefe
bq. src/slave/lxc_isolation_module.cpp 66a2a89
bq. src/slave/main.cpp 85cba25
bq. src/slave/process_based_isolation_module.hpp f6f9554
bq. src/slave/process_based_isolation_module.cpp 2b37d42
bq. src/slave/slave.hpp 279bc7b
bq. src/slave/slave.cpp 3358ec4
bq. src/slave/statusupdates_manager.hpp PRE-CREATION
bq. src/slave/statusupdates_manager.cpp PRE-CREATION
bq. src/tests/external_tests.cpp d1b20e4
bq. src/tests/fault_tolerance_tests.cpp 6772daf
bq. src/tests/slave_restart_tests.cpp PRE-CREATION
bq. src/tests/utils.hpp e81ec82
bq.
bq. Diff: https://reviews.apache.org/r/4462/diff
bq.
bq.
bq. Testing
bq. -------
bq.
bq. make check.
bq.
bq. Note that only the new test in tests/slave_restart_tests.cpp engages in
recovery!
bq.
bq. Recovery is disabled for old tests (though they still checkpoint relevant
info!)
bq.
bq.
bq. Thanks,
bq.
bq. Vinod
bq.
bq.
> Mesos deploys should not restart tasks
> --------------------------------------
>
> Key: MESOS-110
> URL: https://issues.apache.org/jira/browse/MESOS-110
> Project: Mesos
> Issue Type: Improvement
> Components: framework
> Reporter: Rob Benson
> Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in
> that Mesos build deploys restart your tasks. This could lead to nontrivial
> outages for services that have a high warm-up time. Basically everything
> would need a graceful restart mechanism that basically allows a
> shutdown/restart with a new version of the code.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira