On 21/06/13 08:32 PM, Lin wrote: > Hi, > > Can you answer these 3 questions ? > > 1. Which Ray version are you using ? > > Ray version 2.2.0 > License for Ray: GNU General Public License version 3 > RayPlatform version: 1.1.1 > License for RayPlatform: GNU Lesser General Public License version 3 > > 2. Which MPI library are you using (Open-MPI, MPICH, MVAPICH, Intel MPI, or > another) ? > (Open MPI) 1.5.4 > > 3. What is the version of your MPI library ? > MCA v2.0, API v2.0, Component v1.5.4 > Open RTE: 1.5.4 > OPAL: 1.5.4 > > It seems like the newest version of MPI is 1.7.1. > Does that matter? >
I think the 1.5 series was a feature series (it is now retired). The 1.6 series is the most stable right now. http://www.open-mpi.org/software/ompi/v1.6/ > Thanks > > > > On Fri, Jun 21, 2013 at 7:40 AM, Sébastien Boisvert > <[email protected] <mailto:[email protected]>> > wrote: > > [Please use the mailing list] > > > On 20/06/13 11:39 PM, Lin wrote: > > I reduced the reads and ran it without nohup. > Now it always end up with the following problem. > " > mpiexec has exited due to process rank 6 with PID 24558 on > node oak exiting improperly. There are two reasons this could occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls > "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > terminated by signals sent by mpiexec (as reported here). > " > > > Can you answer these 3 questions ? > > 1. Which Ray version are you using ? > > 2. Which MPI library are you using (Open-MPI, MPICH, MVAPICH, Intel MPI, > or another) ? > > 3. What is the version of your MPI library ? > > > On Wed, Jun 19, 2013 at 3:05 PM, Sébastien Boisvert > <sebastien.boisvert.3@ulaval.__ca <mailto:[email protected]> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>> wrote: > > On 19/06/13 05:00 PM, Lin wrote: > > Hi, Sébastien, > > I tried your suggestion and ran my job inside screen without > nohup. > It did not stop with signal SIGHUP. > However, the Ray almost ran out of memory. And it has been > running over three days from I ran the job. > It always repeat some information like this > > > You can reduce the number of reads or increase the number of > machines on which you run Ray. > > > > " > Rank 1 is counting k-mers in sequence reads > [11200001/22166944] > Speed RAY_SLAVE_MODE_ADD_VERTICES 0 units/second > Estimated remaining time for this step: -8 seconds > Rank 10 has 621700000 vertices > Rank 10: assembler memory usage: 36323284 KiB > Rank 13 has 621600000 vertices > Rank 13: assembler memory usage: 36323288 KiB > Rank 8 has 621700000 vertices > Rank 8: assembler memory usage: 36323280 KiB > Rank 7 has 621700000 vertices > Rank 7: assembler memory usage: 36323280 KiB > Rank 3 has 621800000 vertices > Rank 3: assembler memory usage: 36323284 KiB > Rank 2 has 621700000 vertices > Rank 2: assembler memory usage: 36323284 KiB > Rank 1 has 621700000 vertices > Rank 1: assembler memory usage: 36319196 KiB > Rank 12 has 621700000 vertices > Rank 12: assembler memory usage: 36323280 KiB > Rank 6 has 621700000 vertices > ..... > ..... > Rank 5 is counting k-mers in sequence reads > [11000001/22166944] > Speed RAY_SLAVE_MODE_ADD_VERTICES 0 units/second > Estimated remaining time for this step: -8 seconds > " > > > > > > On Fri, Jun 14, 2013 at 3:24 PM, Sébastien Boisvert > <sebastien.boisvert.3@ulaval.____ca <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>>> wrote: > > Hello, > > I don't really understand what is the problem. > > You said that one of your Ray processes is receiving a > SIGHUP signal, even if you are running > the whole thing with nohup, right ? > > > One explanation could be that another user is sending > SIGHUP with the kill program to your Ray processes. > > > Can you try running your job inside screen or tmux ? > > > On 13/06/13 01:27 PM, Lin wrote: > > On 12/06/13 05:10 PM, Lin wrote: > > Hi, > > Yes, they are. When I run "top" or "ps". > There are exactly 16 Ray ranks and one mpiexec process in the oak machine. > > > But this is before one of the Ray ranks receives a > SIGHUP (1, Hangup), right ? > > Yes. Sometime the SIGHUP will lead to not only one > process killed. > > This is the newest error message I got. > """ > mpiexec noticed that process rank 8 with PID 18757 > on node oak exited on signal 1 (Hangup). > > ------------------------------______--------------------------__--__--__-------------- > > > 3 total processes killed (some possibly by mpiexec > during cleanup) > """ > > > But this problem does not always happen > because I have gotten some good results from Ray when I ran it for other > datasets. > > > I never got this SIGHUP with Ray. That's strange. > > Is it reproducible, meaning that if you run the > same thing 10 times, do you get this SIGHUP 10 times too ? > > I can not say it is totally reproducible. But if I > run the same thing 10 times. I guess 9 of them will be failed. > > Yes, it is really strange because I did not get any > error when I ran it the first several times. > > > > > On Thu, Jun 13, 2013 at 7:59 AM, Sébastien Boisvert > <sebastien.boisvert.3@ulaval.______ca <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>>______ulaval.ca <http://ulaval.ca> > <http://ulaval.ca> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>>>> wrote: > > On 12/06/13 05:10 PM, Lin wrote: > > Hi, > > Yes, they are. When I run "top" or "ps". > There are exactly 16 Ray ranks and one mpiexec process in the oak machine. > > > But this is before one of the Ray ranks > receives a SIGHUP (1, Hangup), right ? > > > > But this problem does not always happen > because I have gotten some good results from Ray when I ran it for other > datasets. > > > I never got this SIGHUP with Ray. That's > strange. > > Is it reproducible, meaning that if you run > the same thing 10 times, do you get this SIGHUP 10 times too ? > > > Thanks > Lin > > > > On Wed, Jun 12, 2013 at 8:00 AM, Sébastien > Boisvert <sebastien.boisvert.3@ulaval.________ca > <mailto:sebastien.boisvert.3@ <mailto:sebastien.boisvert.3@> > <mailto:sebastien.boisvert.3@ <mailto:sebastien.boisvert.3@>>______ulaval.ca > <http://ulaval.ca> <http://ulaval.ca> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>>> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>>__>______ulaval.ca <http://ulaval.ca> > <http://ulaval.ca> <http://ulaval.ca> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>>______ulaval.ca <http://ulaval.ca> > <http://ulaval.ca> > <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>>>>> wrote: > > On 10/06/13 05:26 PM, Lin wrote: > > Hi, > > Thanks for your answers. > However, I got the error message > from nohup.out. That is to say, I have used nohup to run Ray. > > This is my command: > nohup mpiexec -n 16 Ray Col.conf & > > > Are all your MPI ranks running on the > "oak" machine ? > > > And the Col.conf contains: > > -k 55 # this is a comment > -p > /s/oak/a/nobackup/lin/Art/Col___________illumina_art/Col_il1.____fastq > > /s/oak/a/nobackup/lin/Art/Col___________illumina_art/Col_il2.____fastq > > -o RayOutputOfCol > > > > > > > > On Mon, Jun 10, 2013 at 2:02 PM, > Sébastien Boisvert <sebastien.boisvert.3@ulaval.__________ca > <mailto:sebastien.boisvert.3@ <mailto:sebastien.boisvert.3@> > <mailto:sebastien.boisvert.3@ <mailto:sebastien.boisvert.3@>> > <mailto:sebastien.boisvert.3@ <mailto:sebastien.boisvert.3@> > <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>>__>______ulaval.ca <http://ulaval.ca> > <http://ulaval.ca> <http://ulaval.ca> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>>______ulaval.ca <http://ulaval.ca> > <http://ulaval.ca> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>>>> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>>__> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>>__>__>______ulaval.ca <http://ulaval.ca> > <http://ulaval.ca> <http://ulaval.ca> <http://ulaval.ca> > <mailto:sebastien.boisvert.3@ <mailto:sebastien.boisvert.3@> > <mailto:sebastien.boisvert.3@ <mailto:sebastien.boisvert.3@>> > <mailto:sebastien.boisvert.3@ <mailto:sebastien.boisvert.3@> > <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>>__>______ulaval.ca <http://ulaval.ca> > <http://ulaval.ca> <http://ulaval.ca> > > <mailto:sebastien.boisvert.3@ <mailto:sebastien.boisvert.3@> > <mailto:sebastien.boisvert.3@ <mailto:sebastien.boisvert.3@>>______ulaval.ca > <http://ulaval.ca> <http://ulaval.ca> <mailto:sebastien.boisvert.3@ > <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:[email protected]>>>>>>> wrote: > > On 09/06/13 11:35 AM, Lin > wrote: > > Hi, Sébastien > > I changed the Max Kmer > to 64. And set it as 55 in a run. > But it always end up > with a problem like this. > "mpiexec noticed that > process rank 11 with PID 25012 on node oak exited on signal 1(Hangup)" > Could you help me figure > it out? > > > The signal 1 is SIGHUP > according to this list: > > $ kill -l > 1) SIGHUP 2) SIGINT > 3) SIGQUIT 4) SIGILL 5) SIGTRAP > 6) SIGABRT 7) SIGBUS > 8) SIGFPE 9) SIGKILL 10) SIGUSR1 > 11) SIGSEGV 12) SIGUSR2 > 13) SIGPIPE 14) SIGALRM 15) SIGTERM > 16) SIGSTKFLT 17) SIGCHLD > 18) SIGCONT 19) SIGSTOP 20) SIGTSTP > 21) SIGTTIN 22) SIGTTOU > 23) SIGURG 24) SIGXCPU 25) SIGXFSZ > 26) SIGVTALRM 27) SIGPROF > 28) SIGWINCH 29) SIGIO 30) SIGPWR > 31) SIGSYS 34) SIGRTMIN > 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3 > 38) SIGRTMIN+4 39) > SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8 > 43) SIGRTMIN+9 44) > SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13 > 48) SIGRTMIN+14 49) > SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12 > 53) SIGRTMAX-11 54) > SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7 > 58) SIGRTMAX-6 59) > SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2 > 63) SIGRTMAX-1 64) SIGRTMAX > > > This signal is not related > to the compilation option MAXKMERLENGTH=64. > > You are gettig this signal > because the parent process of your mpiexec process dies > (probably because you are > closing your terminal) and this causes the SIGHUP that is being sent to your > Ray processes. > > > There are several solutions > to this issue (pick up on solution in the list below): > > > 1. Use nohup^(i.e.: nohup > mpiexec -n 999 Ray -p data1.fastq.gz data2.fastq.gz > > 2. Launch your work inside a > screen session (the screen command) > > 3. Launch your work inside a > tmux session (the tmux command) > > 4. Use a job scheduler (like > Moab, Grid Engine, or another). > > > --SÉB-- > > > > ------------------------------____________--------------------__--__--__--__--__--__----------__----__---- > > > > > > How ServiceNow helps IT > people transform IT departments: > 1. A cloud service to > automate IT design, transition and operations > 2. Dashboards that offer > high-level views of enterprise services > 3. A single system of > record for all IT processes > http://p.sf.net/sfu/____________servicenow-d2d-j > <http://p.sf.net/sfu/__________servicenow-d2d-j> > <http://p.sf.net/sfu/__________servicenow-d2d-j > <http://p.sf.net/sfu/________servicenow-d2d-j>> > <http://p.sf.net/sfu/__________servicenow-d2d-j > <http://p.sf.net/sfu/________servicenow-d2d-j> > <http://p.sf.net/sfu/________servicenow-d2d-j > <http://p.sf.net/sfu/______servicenow-d2d-j>>> > <http://p.sf.net/sfu/__________servicenow-d2d-j > <http://p.sf.net/sfu/________servicenow-d2d-j> > <http://p.sf.net/sfu/________servicenow-d2d-j > <http://p.sf.net/sfu/______servicenow-d2d-j>> > <http://p.sf.net/sfu/________servicenow-d2d-j > <http://p.sf.net/sfu/______servicenow-d2d-j> > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j>>>> > <http://p.sf.net/sfu/__________servicenow-d2d-j > <http://p.sf.net/sfu/________servicenow-d2d-j> > <http://p.sf.net/sfu/________servicenow-d2d-j > <http://p.sf.net/sfu/______servicenow-d2d-j>> > <http://p.sf.net/sfu/________servicenow-d2d-j > <http://p.sf.net/sfu/______servicenow-d2d-j> > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j>>> > <http://p.sf.net/sfu/________servicenow-d2d-j > <http://p.sf.net/sfu/______servicenow-d2d-j> > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j>> > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j> > <http://p.sf.net/sfu/____servicenow-d2d-j > <http://p.sf.net/sfu/__servicenow-d2d-j>>>>> > <http://p.sf.net/sfu/__________servicenow-d2d-j > <http://p.sf.net/sfu/________servicenow-d2d-j> > <http://p.sf.net/sfu/________servicenow-d2d-j > <http://p.sf.net/sfu/______servicenow-d2d-j>> > <http://p.sf.net/sfu/________servicenow-d2d-j > <http://p.sf.net/sfu/______servicenow-d2d-j> > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j>>> > <http://p.sf.net/sfu/________servicenow-d2d-j > <http://p.sf.net/sfu/______servicenow-d2d-j> > > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j>> > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j> > <http://p.sf.net/sfu/____servicenow-d2d-j > <http://p.sf.net/sfu/__servicenow-d2d-j>>>> > <http://p.sf.net/sfu/________servicenow-d2d-j > <http://p.sf.net/sfu/______servicenow-d2d-j> > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j>> > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j> > <http://p.sf.net/sfu/____servicenow-d2d-j > <http://p.sf.net/sfu/__servicenow-d2d-j>>> > <http://p.sf.net/sfu/______servicenow-d2d-j > <http://p.sf.net/sfu/____servicenow-d2d-j> > <http://p.sf.net/sfu/____servicenow-d2d-j > <http://p.sf.net/sfu/__servicenow-d2d-j>> > <http://p.sf.net/sfu/____servicenow-d2d-j > <http://p.sf.net/sfu/__servicenow-d2d-j> > <http://p.sf.net/sfu/__servicenow-d2d-j > <http://p.sf.net/sfu/servicenow-d2d-j>>>>>> > > ___________________________________________________________ > Denovoassembler-users > mailing list > > Denovoassembler-users@lists.____________sourceforge.net > <http://sourceforge.net> <http://sourceforge.net> <http://sourceforge.net> > <http://sourceforge.net> <http://sourceforge.net> > <mailto:Denovoassembler-users@ <mailto:Denovoassembler-users@> > <mailto:Denovoassembler-users@ <mailto:Denovoassembler-users@>__> > <mailto:Denovoassembler-users@ <mailto:Denovoassembler-users@> > <mailto:Denovoassembler-users@ <mailto:Denovoassembler-users@>__>__> > <mailto:Denovoassembler-users@ <mailto:Denovoassembler-users@> > <mailto:Denovoassembler-users@ <mailto:Denovoassembler-users@>__> > <mailto:Denovoassembler-users@ <mailto:Denovoassembler-users@> > <mailto:Denovoassembler-users@ > <mailto:Denovoassembler-users@>__>__>__>____lists.sourceforge.__net > <http://lists.sourceforge.net> <http://lists.sourceforge.net> > <http://lists.sourceforge.net> <http://lists.sourceforge.net> > <mailto:Denovoassembler-users@ <mailto:Denovoassembler-users@> > <mailto:Denovoassembler-users@ <mailto:Denovoassembler-users@>__> > <mailto:Denovoassembler-users@ <mailto:Denovoassembler-users@> > <mailto:Denovoassembler-users@ > <mailto:Denovoassembler-users@>__>__>____lists.sourceforge.net > <http://lists.sourceforge.net> <http://lists.sourceforge.net> > <http://lists.sourceforge.net> <mailto:Denovoassembler-users@ > <mailto:Denovoassembler-users@> <mailto:Denovoassembler-users@ > <mailto:Denovoassembler-users@>__>____lists.sourceforge.net > <http://lists.sourceforge.net> <http://lists.sourceforge.net> > <mailto:Denovoassembler-users@ > <mailto:Denovoassembler-users@>____lists.sourceforge.net > <http://lists.sourceforge.net> > <mailto:Denovoassembler-users@__lists.sourceforge.net > <mailto:[email protected]>>>>>> > > https://lists.sourceforge.net/____________lists/listinfo/____________denovoassembler-users > > <https://lists.sourceforge.net/__________lists/listinfo/__________denovoassembler-users> > > <https://lists.sourceforge.__net/________lists/listinfo/__________denovoassembler-users > > <https://lists.sourceforge.net/________lists/listinfo/________denovoassembler-users>> > > <https://lists.sourceforge.____net/______lists/listinfo/__________denovoassembler-users > > <https://lists.sourceforge.__net/______lists/listinfo/________denovoassembler-users > > <https://lists.sourceforge.net/______lists/listinfo/______denovoassembler-users>>> > > <https://lists.sourceforge.______net/____lists/listinfo/__________denovoassembler-users > > <https://lists.sourceforge.____net/____lists/listinfo/________denovoassembler-users > > <https://lists.sourceforge.__net/____lists/listinfo/______denovoassembler-users > > <https://lists.sourceforge.net/____lists/listinfo/____denovoassembler-users>>>> > > <https://lists.sourceforge.________net/__lists/listinfo/__________denovoassembler-users > > <https://lists.sourceforge.______net/__lists/listinfo/________denovoassembler-users > > <https://lists.sourceforge.____net/__lists/listinfo/______denovoassembler-users > <https://lists.sourceforge.__net/__lists/listinfo/____denovoassembler-users > <https://lists.sourceforge.net/__lists/listinfo/__denovoassembler-users>>>>> > <https://lists.sourceforge.__________net/lists/listinfo/__________denovoassembler-users > > > <https://lists.sourceforge.________net/lists/listinfo/________denovoassembler-users > > <https://lists.sourceforge.______net/lists/listinfo/______denovoassembler-users > <https://lists.sourceforge.____net/lists/listinfo/____denovoassembler-users > <https://lists.sourceforge.__net/lists/listinfo/__denovoassembler-users > <https://lists.sourceforge.net/lists/listinfo/denovoassembler-users>>>>>> > > > > > > > > > > > > > ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Denovoassembler-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
