On 19/06/13 05:00 PM, Lin wrote: > Hi, Sébastien, > > I tried your suggestion and ran my job inside screen without nohup. > It did not stop with signal SIGHUP. > However, the Ray almost ran out of memory. And it has been running over three > days from I ran the job. > It always repeat some information like this
You can reduce the number of reads or increase the number of machines on which you run Ray. > " > Rank 1 is counting k-mers in sequence reads [11200001/22166944] > Speed RAY_SLAVE_MODE_ADD_VERTICES 0 units/second > Estimated remaining time for this step: -8 seconds > Rank 10 has 621700000 vertices > Rank 10: assembler memory usage: 36323284 KiB > Rank 13 has 621600000 vertices > Rank 13: assembler memory usage: 36323288 KiB > Rank 8 has 621700000 vertices > Rank 8: assembler memory usage: 36323280 KiB > Rank 7 has 621700000 vertices > Rank 7: assembler memory usage: 36323280 KiB > Rank 3 has 621800000 vertices > Rank 3: assembler memory usage: 36323284 KiB > Rank 2 has 621700000 vertices > Rank 2: assembler memory usage: 36323284 KiB > Rank 1 has 621700000 vertices > Rank 1: assembler memory usage: 36319196 KiB > Rank 12 has 621700000 vertices > Rank 12: assembler memory usage: 36323280 KiB > Rank 6 has 621700000 vertices > ..... > ..... > Rank 5 is counting k-mers in sequence reads [11000001/22166944] > Speed RAY_SLAVE_MODE_ADD_VERTICES 0 units/second > Estimated remaining time for this step: -8 seconds > " > > > > > > On Fri, Jun 14, 2013 at 3:24 PM, Sébastien Boisvert > <[email protected] <mailto:[email protected]>> > wrote: > > Hello, > > I don't really understand what is the problem. > > You said that one of your Ray processes is receiving a SIGHUP signal, > even if you are running > the whole thing with nohup, right ? > > > One explanation could be that another user is sending SIGHUP with the > kill program to your Ray processes. > > > Can you try running your job inside screen or tmux ? > > > On 13/06/13 01:27 PM, Lin wrote: > > On 12/06/13 05:10 PM, Lin wrote: > > Hi, > > Yes, they are. When I run "top" or "ps". There are exactly 16 > Ray ranks and one mpiexec process in the oak machine. > > > But this is before one of the Ray ranks receives a SIGHUP (1, > Hangup), right ? > > Yes. Sometime the SIGHUP will lead to not only one process killed. > > This is the newest error message I got. > """ > mpiexec noticed that process rank 8 with PID 18757 on node oak exited > on signal 1 (Hangup). > > ------------------------------__------------------------------__-------------- > 3 total processes killed (some possibly by mpiexec during cleanup) > """ > > > But this problem does not always happen because I have gotten > some good results from Ray when I ran it for other datasets. > > > I never got this SIGHUP with Ray. That's strange. > > Is it reproducible, meaning that if you run the same thing 10 times, > do you get this SIGHUP 10 times too ? > > I can not say it is totally reproducible. But if I run the same thing > 10 times. I guess 9 of them will be failed. > > Yes, it is really strange because I did not get any error when I ran > it the first several times. > > > > > On Thu, Jun 13, 2013 at 7:59 AM, Sébastien Boisvert > <sebastien.boisvert.3@ulaval.__ca <mailto:[email protected]> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>> wrote: > > On 12/06/13 05:10 PM, Lin wrote: > > Hi, > > Yes, they are. When I run "top" or "ps". There are exactly > 16 Ray ranks and one mpiexec process in the oak machine. > > > But this is before one of the Ray ranks receives a SIGHUP (1, > Hangup), right ? > > > > But this problem does not always happen because I have > gotten some good results from Ray when I ran it for other datasets. > > > I never got this SIGHUP with Ray. That's strange. > > Is it reproducible, meaning that if you run the same thing 10 > times, do you get this SIGHUP 10 times too ? > > > Thanks > Lin > > > > On Wed, Jun 12, 2013 at 8:00 AM, Sébastien Boisvert > <sebastien.boisvert.3@ulaval.____ca <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>>> wrote: > > On 10/06/13 05:26 PM, Lin wrote: > > Hi, > > Thanks for your answers. > However, I got the error message from nohup.out. > That is to say, I have used nohup to run Ray. > > This is my command: > nohup mpiexec -n 16 Ray Col.conf & > > > Are all your MPI ranks running on the "oak" machine ? > > > And the Col.conf contains: > > -k 55 # this is a comment > -p > /s/oak/a/nobackup/lin/Art/Col_______illumina_art/Col_il1.fastq > > /s/oak/a/nobackup/lin/Art/Col_______illumina_art/Col_il2.fastq > > -o RayOutputOfCol > > > > > > > On Mon, Jun 10, 2013 at 2:02 PM, Sébastien Boisvert > <sebastien.boisvert.3@ulaval.______ca <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>>______ulaval.ca <http://ulaval.ca> > <http://ulaval.ca> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>>>> wrote: > > On 09/06/13 11:35 AM, Lin wrote: > > Hi, Sébastien > > I changed the Max Kmer to 64. And set it > as 55 in a run. > But it always end up with a problem like > this. > "mpiexec noticed that process rank 11 with > PID 25012 on node oak exited on signal 1(Hangup)" > Could you help me figure it out? > > > The signal 1 is SIGHUP according to this list: > > $ kill -l > 1) SIGHUP 2) SIGINT 3) SIGQUIT > 4) SIGILL 5) SIGTRAP > 6) SIGABRT 7) SIGBUS 8) SIGFPE > 9) SIGKILL 10) SIGUSR1 > 11) SIGSEGV 12) SIGUSR2 13) SIGPIPE > 14) SIGALRM 15) SIGTERM > 16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT > 19) SIGSTOP 20) SIGTSTP > 21) SIGTTIN 22) SIGTTOU 23) SIGURG > 24) SIGXCPU 25) SIGXFSZ > 26) SIGVTALRM 27) SIGPROF 28) SIGWINCH > 29) SIGIO 30) SIGPWR > 31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 > 36) SIGRTMIN+2 37) SIGRTMIN+3 > 38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 > 41) SIGRTMIN+7 42) SIGRTMIN+8 > 43) SIGRTMIN+9 44) SIGRTMIN+10 45) > SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13 > 48) SIGRTMIN+14 49) SIGRTMIN+15 50) > SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12 > 53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 > 56) SIGRTMAX-8 57) SIGRTMAX-7 > 58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 > 61) SIGRTMAX-3 62) SIGRTMAX-2 > 63) SIGRTMAX-1 64) SIGRTMAX > > > This signal is not related to the compilation > option MAXKMERLENGTH=64. > > You are gettig this signal because the parent > process of your mpiexec process dies > (probably because you are closing your > terminal) and this causes the SIGHUP that is being sent to your Ray processes. > > > There are several solutions to this issue > (pick up on solution in the list below): > > > 1. Use nohup^(i.e.: nohup mpiexec -n 999 Ray > -p data1.fastq.gz data2.fastq.gz > > 2. Launch your work inside a screen session > (the screen command) > > 3. Launch your work inside a tmux session (the > tmux command) > > 4. Use a job scheduler (like Moab, Grid > Engine, or another). > > > --SÉB-- > > > > ------------------------------________------------------------__--__--__--__------------------ > > > > How ServiceNow helps IT people transform > IT departments: > 1. A cloud service to automate IT design, > transition and operations > 2. Dashboards that offer high-level views > of enterprise services > 3. A single system of record for all IT > processes > http://p.sf.net/sfu/________servicenow-d2d-j > <http://p.sf.net/sfu/______servicenow-d2d-j> > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j>> > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j> > <http://p.sf.net/sfu/____servicenow-d2d-j > <http://p.sf.net/sfu/__servicenow-d2d-j>>> > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j> > <http://p.sf.net/sfu/____servicenow-d2d-j > <http://p.sf.net/sfu/__servicenow-d2d-j>> > <http://p.sf.net/sfu/____servicenow-d2d-j > <http://p.sf.net/sfu/__servicenow-d2d-j> > <http://p.sf.net/sfu/__servicenow-d2d-j > <http://p.sf.net/sfu/servicenow-d2d-j>>>> > > _______________________________________________________ > Denovoassembler-users mailing list > > Denovoassembler-users@lists.________sourceforge.net <http://sourceforge.net> > <http://sourceforge.net> <http://sourceforge.net> > <mailto:Denovoassembler-users@ <mailto:Denovoassembler-users@> > <mailto:Denovoassembler-users@ > <mailto:Denovoassembler-users@>__>____lists.sourceforge.net > <http://lists.sourceforge.net> <http://lists.sourceforge.net> > <mailto:Denovoassembler-users@ > <mailto:Denovoassembler-users@>____lists.sourceforge.net > <http://lists.sourceforge.net> > <mailto:Denovoassembler-users@__lists.sourceforge.net > <mailto:[email protected]>>>> > > https://lists.sourceforge.net/________lists/listinfo/________denovoassembler-users > > <https://lists.sourceforge.net/______lists/listinfo/______denovoassembler-users> > > <https://lists.sourceforge.__net/____lists/listinfo/______denovoassembler-users > > <https://lists.sourceforge.net/____lists/listinfo/____denovoassembler-users>> > <https://lists.sourceforge.____net/__lists/listinfo/______denovoassembler-users > <https://lists.sourceforge.__net/__lists/listinfo/____denovoassembler-users > <https://lists.sourceforge.net/__lists/listinfo/__denovoassembler-users>>> > <https://lists.sourceforge.______net/lists/listinfo/______denovoassembler-users > <https://lists.sourceforge.____net/lists/listinfo/____denovoassembler-users > <https://lists.sourceforge.__net/lists/listinfo/__denovoassembler-users > <https://lists.sourceforge.net/lists/listinfo/denovoassembler-users>>>> > > > > > > > > > ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Denovoassembler-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
