[ 
https://issues.apache.org/jira/browse/MESOS-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya reassigned MESOS-1648:
---------------------------

    Assignee:     (was: Ilya)

> Add a --pidfile option to master and agent binaries.
> ----------------------------------------------------
>
>                 Key: MESOS-1648
>                 URL: https://issues.apache.org/jira/browse/MESOS-1648
>             Project: Mesos
>          Issue Type: Improvement
>          Components: agent, master
>            Reporter: Tobias Weingartner
>            Priority: Major
>              Labels: newbie, twitter
>
> Right now we use a number of wrapper scripts to try and keep up a 
> {{/var/run/mesos/mesos-slave.pid}} in order to be able to monitor the 
> process.  This has proven to be somewhat fragile due to the lack of locking 
> and the possibility of races and stale data.
> By adding a {{--pidfile}}, we can obtain a lock on the file to prevent 
> multiple binaries from starting, and to enable the tooling to validate that 
> the lock is held before doing any signaling. We can also do a best effort 
> unlink in the signal handler upon termination:
> {code}
> // Get exclusive access to the file.
> fd = open(O_CREAT ...)
> flock(fd, LOCK_EX)
> if not locked, abort
> ftruncate(fd, 0)
> // Write the pid.
> write(fd, "<pid>")
> // Inside signal handler..
> unlink(pidfile)
> {code}
> Digging around, looks like the open, ftruncate, write pattern is pretty 
> common:
> http://man7.org/tlpi/code/online/diff/filelock/create_pid_file.c.html
> The tooling around it could that the file is locked by the pid inside it, 
> before taking any action (like signaling):
> *Case 1*: If the file does not exist or is not locked, then assume nothing is 
> running. It's possible for something to be running and about to grab the 
> lock, but we'll eventually read it correctly and converge on a single 
> instance started correctly.
> *Case 2*: If the file is locked, and the pid doesn't match, then assume it is 
> running but not as the pid in the file (.. yet). Treat this the same as (1), 
> assume it's not running, and the next attempts to start will eventually 
> converge on a single instance running.
> *Case 3*: If the file is locked, and the pid matches the locker process, then 
> assume it is running as that pid. Note that it's still possible that in 
> between matching the pid and taking an action (e.g. kill), the pid may become 
> stale, but the recycling pattern of pids makes it unlikely to be re-used 
> unless there is a large delay.
> It seems like some tools already do this signal wrapping (note the comment 
> about fcntl and note the race from (3) in the BUGS section):
> http://manpages.ubuntu.com/manpages/natty/man8/ovs-kill.8.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to