Hello, I don't really understand what is the problem.
You said that one of your Ray processes is receiving a SIGHUP signal, even if you are running the whole thing with nohup, right ? One explanation could be that another user is sending SIGHUP with the kill program to your Ray processes. Can you try running your job inside screen or tmux ? On 13/06/13 01:27 PM, Lin wrote: > On 12/06/13 05:10 PM, Lin wrote: > > Hi, > > Yes, they are. When I run "top" or "ps". There are exactly 16 Ray ranks > and one mpiexec process in the oak machine. > > > But this is before one of the Ray ranks receives a SIGHUP (1, Hangup), right ? > > Yes. Sometime the SIGHUP will lead to not only one process killed. > > This is the newest error message I got. > """ > mpiexec noticed that process rank 8 with PID 18757 on node oak exited on > signal 1 (Hangup). > -------------------------------------------------------------------------- > 3 total processes killed (some possibly by mpiexec during cleanup) > """ > > > But this problem does not always happen because I have gotten some good > results from Ray when I ran it for other datasets. > > > I never got this SIGHUP with Ray. That's strange. > > Is it reproducible, meaning that if you run the same thing 10 times, do you > get this SIGHUP 10 times too ? > > I can not say it is totally reproducible. But if I run the same thing 10 > times. I guess 9 of them will be failed. > > Yes, it is really strange because I did not get any error when I ran it the > first several times. > > > > > On Thu, Jun 13, 2013 at 7:59 AM, Sébastien Boisvert > <[email protected] <mailto:[email protected]>> > wrote: > > On 12/06/13 05:10 PM, Lin wrote: > > Hi, > > Yes, they are. When I run "top" or "ps". There are exactly 16 Ray > ranks and one mpiexec process in the oak machine. > > > But this is before one of the Ray ranks receives a SIGHUP (1, Hangup), > right ? > > > > But this problem does not always happen because I have gotten some > good results from Ray when I ran it for other datasets. > > > I never got this SIGHUP with Ray. That's strange. > > Is it reproducible, meaning that if you run the same thing 10 times, do > you get this SIGHUP 10 times too ? > > > Thanks > Lin > > > > On Wed, Jun 12, 2013 at 8:00 AM, Sébastien Boisvert > <sebastien.boisvert.3@ulaval.__ca <mailto:[email protected]> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>> wrote: > > On 10/06/13 05:26 PM, Lin wrote: > > Hi, > > Thanks for your answers. > However, I got the error message from nohup.out. That is to > say, I have used nohup to run Ray. > > This is my command: > nohup mpiexec -n 16 Ray Col.conf & > > > Are all your MPI ranks running on the "oak" machine ? > > > And the Col.conf contains: > > -k 55 # this is a comment > -p > /s/oak/a/nobackup/lin/Art/Col_____illumina_art/Col_il1.fastq > > /s/oak/a/nobackup/lin/Art/Col_____illumina_art/Col_il2.fastq > > -o RayOutputOfCol > > > > > > On Mon, Jun 10, 2013 at 2:02 PM, Sébastien Boisvert > <sebastien.boisvert.3@ulaval.____ca <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>>> wrote: > > On 09/06/13 11:35 AM, Lin wrote: > > Hi, Sébastien > > I changed the Max Kmer to 64. And set it as 55 in a > run. > But it always end up with a problem like this. > "mpiexec noticed that process rank 11 with PID > 25012 on node oak exited on signal 1(Hangup)" > Could you help me figure it out? > > > The signal 1 is SIGHUP according to this list: > > $ kill -l > 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) > SIGILL 5) SIGTRAP > 6) SIGABRT 7) SIGBUS 8) SIGFPE 9) > SIGKILL 10) SIGUSR1 > 11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) > SIGALRM 15) SIGTERM > 16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) > SIGSTOP 20) SIGTSTP > 21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) > SIGXCPU 25) SIGXFSZ > 26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) > SIGIO 30) SIGPWR > 31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) > SIGRTMIN+2 37) SIGRTMIN+3 > 38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) > SIGRTMIN+7 42) SIGRTMIN+8 > 43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) > SIGRTMIN+12 47) SIGRTMIN+13 > 48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) > SIGRTMAX-13 52) SIGRTMAX-12 > 53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) > SIGRTMAX-8 57) SIGRTMAX-7 > 58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) > SIGRTMAX-3 62) SIGRTMAX-2 > 63) SIGRTMAX-1 64) SIGRTMAX > > > This signal is not related to the compilation option > MAXKMERLENGTH=64. > > You are gettig this signal because the parent process > of your mpiexec process dies > (probably because you are closing your terminal) and > this causes the SIGHUP that is being sent to your Ray processes. > > > There are several solutions to this issue (pick up on > solution in the list below): > > > 1. Use nohup^(i.e.: nohup mpiexec -n 999 Ray -p > data1.fastq.gz data2.fastq.gz > > 2. Launch your work inside a screen session (the screen > command) > > 3. Launch your work inside a tmux session (the tmux > command) > > 4. Use a job scheduler (like Moab, Grid Engine, or > another). > > > --SÉB-- > > > > ------------------------------______--------------------------__--__--__------------------ > > > How ServiceNow helps IT people transform IT > departments: > 1. A cloud service to automate IT design, > transition and operations > 2. Dashboards that offer high-level views of > enterprise services > 3. A single system of record for all IT processes > http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j> > <http://p.sf.net/sfu/____servicenow-d2d-j > <http://p.sf.net/sfu/__servicenow-d2d-j>> > <http://p.sf.net/sfu/____servicenow-d2d-j > <http://p.sf.net/sfu/__servicenow-d2d-j> > <http://p.sf.net/sfu/__servicenow-d2d-j > <http://p.sf.net/sfu/servicenow-d2d-j>>> > > _____________________________________________________ > Denovoassembler-users mailing list > Denovoassembler-users@lists.______sourceforge.net > <http://sourceforge.net> <http://sourceforge.net> > <mailto:Denovoassembler-users@ > <mailto:Denovoassembler-users@>____lists.sourceforge.net > <http://lists.sourceforge.net> > <mailto:Denovoassembler-users@__lists.sourceforge.net > <mailto:[email protected]>>> > > https://lists.sourceforge.net/______lists/listinfo/______denovoassembler-users > <https://lists.sourceforge.net/____lists/listinfo/____denovoassembler-users> > <https://lists.sourceforge.__net/__lists/listinfo/____denovoassembler-users > <https://lists.sourceforge.net/__lists/listinfo/__denovoassembler-users>> > <https://lists.sourceforge.____net/lists/listinfo/____denovoassembler-users > <https://lists.sourceforge.__net/lists/listinfo/__denovoassembler-users > <https://lists.sourceforge.net/lists/listinfo/denovoassembler-users>>> > > > > > > > ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Denovoassembler-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
