[
https://issues.apache.org/jira/browse/MESOS-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Greg Mann updated MESOS-1648:
-----------------------------
Assignee: (was: Greg Mann)
> Add a --pidfile option to master and agent binaries.
> ----------------------------------------------------
>
> Key: MESOS-1648
> URL: https://issues.apache.org/jira/browse/MESOS-1648
> Project: Mesos
> Issue Type: Improvement
> Components: master, slave
> Reporter: Tobias Weingartner
> Labels: newbie, twitter
>
> Right now we use a number of wrapper scripts to try and keep up a
> {{/var/run/mesos/mesos-slave.pid}} in order to be able to monitor the
> process. This has proven to be somewhat fragile due to the lack of locking
> and the possibility of races and stale data.
> By adding a {{--pidfile}}, we can obtain a lock on the file to prevent
> multiple binaries from starting, and to enable the tooling to validate that
> the lock is held before doing any signaling. We can also do a best effort
> unlink in the signal handler upon termination:
> {code}
> // Get exclusive access to the file.
> fd = open(O_CREAT ...)
> flock(fd, LOCK_EX)
> if not locked, abort
> ftruncate(fd, 0)
> // Write the pid.
> write(fd, "<pid>")
> // Inside signal handler..
> unlink(pidfile)
> {code}
> Digging around, looks like the open, ftruncate, write pattern is pretty
> common:
> http://man7.org/tlpi/code/online/diff/filelock/create_pid_file.c.html
> The tooling around it could that the file is locked by the pid inside it,
> before taking any action (like signaling):
> *Case 1*: If the file does not exist or is not locked, then assume nothing is
> running. It's possible for something to be running and about to grab the
> lock, but we'll eventually read it correctly and converge on a single
> instance started correctly.
> *Case 2*: If the file is locked, and the pid doesn't match, then assume it is
> running but not as the pid in the file (.. yet). Treat this the same as (1),
> assume it's not running, and the next attempts to start will eventually
> converge on a single instance running.
> *Case 3*: If the file is locked, and the pid matches the locker process, then
> assume it is running as that pid. Note that it's still possible that in
> between matching the pid and taking an action (e.g. kill), the pid may become
> stale, but the recycling pattern of pids makes it unlikely to be re-used
> unless there is a large delay.
> It seems like some tools already do this signal wrapping (note the comment
> about fcntl and note the race from (3) in the BUGS section):
> http://manpages.ubuntu.com/manpages/natty/man8/ovs-kill.8.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)