Instead there seems to be another explanation.
If we look at python's parents and their open files, we can guess who was the one who opened /dev/ipmi0.
$ pstree -p 44409 slurmstepd(44409)─┬─bash(44540)───python(8711) ├─{slurmstepd}(44462) ├─{slurmstepd}(44463) ├─{slurmstepd}(44464) ├─{slurmstepd}(44527) └─{slurmstepd}(44528) $ ls -lah /proc/8711/fd <skip> lrwx------ 1 s9951545 p_ffmk 64 Mar 2 10:47 5 -> /dev/ipmi0 lrwx------ 1 s9951545 p_ffmk 64 Mar 2 10:47 7 -> /dev/ipmi0 $ ls -lah /proc/44540/fd <skip> lrwx------ 1 s9951545 p_ffmk 64 Mar 2 10:40 5 -> /dev/ipmi0 lrwx------ 1 s9951545 p_ffmk 64 Mar 2 10:40 7 -> /dev/ipmi0 $ ls -lah /proc/44409/fd ls: cannot open directory /proc/44409/fd: Permission denied $ ls -lah /proc/44462/fd ls: cannot open directory /proc/44462/fd: Permission deniedUnfortunately, I can't see what are the files opened by slurmstep daemons, but it kind of makes sense for these processes to open /dev/ipmi0
Thus the whole picture looks as follows:slurmstepd opens /dev/ipmi0, then it starts bash process, which inherits open files from the parent. Then I start python in bash, and hence also inherit open file descriptors to /dev/ipmi0.
It seems that on some nodes of our system access to these files was restricted for the user. And this caused the error. Then I changed the nodes, or admins updated system-wide policy during the last month, and error vanished.
I think the fact that slurmstepd does not close files before calling exec looks like a bug.
On 03/01/2016 06:39 PM, Rohan Garg wrote:
Interesting! So your python does talk to the ipmi0 device. I was confused about why would DMTCP try to open a connection to /dev/ipmi0 on restart. One thing still puzzles me though: if /dev/ipmi0 didn't have the right set of permissions earlier, how was python able to open the device at launch time? (Perhaps, the python process started as root and dropped privileges?) Are you using any IPMI library/module for python? Or it could be that one of the libraries/modules you are importing or the python interpreter itself is linked against some IPMI library. That could explain why your python interpreter opens the device. One way to verify this is to see the output of the "ldd" command on the libraries/interpreter executable. Example: # The following command lists the libraries the python interpreter # is linked against. $ ldd `which python` linux-vdso.so.1 (0x00007ffff01a9000) libpython2.7.so.1.0 => /usr/lib64/libpython2.7.so.1.0 (0x00007f27f1dea000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f27f1bcd000) libc.so.6 => /lib64/libc.so.6 (0x00007f27f1827000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f27f1623000) libutil.so.1 => /lib64/libutil.so.1 (0x00007f27f1420000) libm.so.6 => /lib64/libm.so.6 (0x00007f27f1121000) /lib64/ld-linux-x86-64.so.2 (0x000055a49e096000 Another thing that you can try is to run your python interpreter under "strace -f" to trace the system calls it makes, and to verify if it or some child process opens the IPMI device. ----- Original Message ----- From: "Maksym Planeta" <mplan...@os.inf.tu-dresden.de> To: "Rohan Garg" <rohg...@ccs.neu.edu> Cc: "dmtcp-forum" <Dmtcp-forum@lists.sourceforge.net> Sent: Tuesday, March 1, 2016 3:31:02 AM Subject: Re: [Dmtcp-forum] Restart does not work: /dev/ipmi0 Permission denied Hi Rohan, thank you for the reply. Somehow the issue was solved without my intervention. When I retry the same thing now, the error with python vanishes, i. e. I'm able to restart python shell. And the reason for that is that the access right to /dev/ipmi0 have changed: $ ls -lah /dev/ipmi0 crw-rw-rw- 1 root root 245, 0 Nov 5 10:48 /dev/ipmi0 But for the sake of completeness I answer your questions, for the case if it still may be useful. 1. I never run python as root user on this machine, simply because I have no root access. 2. The dmtcp version is 2.4.4. 3. I think this is some Bull Linux, but system release reports following: $ cat /etc/system-release Red Hat Enterprise Linux Server release 6.4 (Santiago) Open file descriptors before checkpoint: $ ls -l /proc/$(ps x | grep -e python | grep -v grep | awk '{print $1}')/fd total 0 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 0 -> /dev/pts/8 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 1 -> /dev/pts/8 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 2 -> /dev/pts/8 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 5 -> /dev/ipmi0 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 7 -> /dev/ipmi0 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 821 -> socket:[34528195] lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 827 -> /dev/pts/8 l-wx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 828 -> /tmp/dmtcp-s9951545@taurusi5591/jassertlog.4b3242428f3a397f-40000-56d5522b_python lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 831 -> /tmp/dmtcp-s9951545@taurusi5591/dmtcpSharedArea.4b3242428f3a397f-40000-56d5522b.56d5522b9 After checkpoint: $ ls -l /proc/$(ps x | grep -e python | grep -v grep | awk '{print $1}')/fd total 0 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 0 -> /dev/pts/8 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 1 -> /dev/pts/8 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 2 -> /dev/pts/8 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 5 -> /dev/ipmi0 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 7 -> /dev/ipmi0 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 821 -> socket:[34528195] lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 827 -> /dev/pts/8 l-wx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 828 -> /tmp/dmtcp-s9951545@taurusi5591/jassertlog.4b3242428f3a397f-40000-56d5522b_python lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 831 -> /tmp/dmtcp-s9951545@taurusi5591/dmtcpSharedArea.4b3242428f3a397f-40000-56d5522b.56d5522b9 And after restart: $ ls -l /proc/$(ps x | grep -e python | grep -v grep | awk '{print $1}')/fd total 0 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 0 -> /dev/pts/6 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 1 -> /dev/pts/6 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 2 -> /dev/pts/6 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 5 -> /dev/ipmi0 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 7 -> /dev/ipmi0 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 821 -> socket:[34528397] lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 827 -> /dev/pts/6 lrwx------ 1 s9951545 p_ffmk 64 Mar 1 09:26 831 -> /tmp/dmtcp-s9951545@taurusi5591/dmtcpSharedArea.4b3242428f3a397f-40000-56d5522b.56d552486 On 02/29/2016 06:13 PM, Rohan Garg wrote:Hi Maksym, This looks like a strange issue. I have some questions about your setup. - Do you launch your python interpreter with sudo privileges or as the root user? - What python version are you using? What DMTCP version are you using? - What distro are you using? At restart time, DMTCP tries to restore file connections that the process had opened at checkpoint time. I'm not sure why it's trying to open '/dev/ipmi0' on restart. Can you share the output of the following command: ls -l /proc/<PID>/fd prior to checkpointing? (Here PID is the process id of the python interpreter that you launch under DMTCP.) This will help us identify if for some strange reason the python interpreter opens /dev/ipmi0 on your setup. Thanks, RohanOn Feb 1, 2016, at 3:39 AM, Maksym Planeta <mplan...@os.inf.tu-dresden.de> wrote: Hello, I'm trying to setup DMTCP. I installed it and launch coordinator. Then I launched python interpreter, created a variable, switched to coordinator, initiated checkpoint, and killed all coordinator clients with "k" command. After this python interpreter was terminated and several new files appeared in the directory where coordinator was running. Next I wanted to restart the interpreter. I still had my coordinator open, so I decided to use dmtcp_restart to launch python again: dmtcp_restart ckpt_*.dmtcp But this resulted in following error report: [40000] ERROR at fileconnection.cpp:863 in openFile; REASON='JASSERT(fd != -1) failed' _path = /dev/ipmi0 (strerror((*__errno_location ()))) = Permission denied I have this file: $ ls /dev/ipmi0 -lah crw-rw---- 1 root root 245, 0 Nov 4 10:22 /dev/ipmi0 But I don't have root permissions to manipulate access rights over the file. Could you tell me what can I do about this? And why DMTCP tries to access a file which the interpreter was never allowed to access? -- Regards, Maksym Planeta ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
-- Regards, Maksym Planeta
smime.p7s
Description: S/MIME Cryptographic Signature
------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum