[Please use the mailing list]

On 20/06/13 11:39 PM, Lin wrote:
> I reduced the reads and ran it without nohup.
> Now it always end up with the following problem.
> "
> mpiexec has exited due to process rank 6 with PID 24558 on
> node oak exiting improperly. There are two reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> "
>

Can you answer these 3 questions ?

1. Which Ray version are you using ?

2. Which MPI library are you using (Open-MPI, MPICH, MVAPICH, Intel MPI, or 
another) ?

3. What is the version of your MPI library ?

>
> On Wed, Jun 19, 2013 at 3:05 PM, Sébastien Boisvert 
> <[email protected] <mailto:[email protected]>> 
> wrote:
>
>     On 19/06/13 05:00 PM, Lin wrote:
>
>         Hi, Sébastien,
>
>         I tried your suggestion and ran my job inside screen without nohup.
>         It did not stop with signal SIGHUP.
>         However, the Ray almost ran out of memory. And it has been running 
> over three days from I ran the job.
>         It always repeat some information like this
>
>
>     You can reduce the number of reads or increase the number of machines on 
> which you run Ray.
>
>
>
>         "
>         Rank 1 is counting k-mers in sequence reads [11200001/22166944]
>         Speed RAY_SLAVE_MODE_ADD_VERTICES 0 units/second
>         Estimated remaining time for this step: -8 seconds
>         Rank 10 has 621700000 vertices
>         Rank 10: assembler memory usage: 36323284 KiB
>         Rank 13 has 621600000 vertices
>         Rank 13: assembler memory usage: 36323288 KiB
>         Rank 8 has 621700000 vertices
>         Rank 8: assembler memory usage: 36323280 KiB
>         Rank 7 has 621700000 vertices
>         Rank 7: assembler memory usage: 36323280 KiB
>         Rank 3 has 621800000 vertices
>         Rank 3: assembler memory usage: 36323284 KiB
>         Rank 2 has 621700000 vertices
>         Rank 2: assembler memory usage: 36323284 KiB
>         Rank 1 has 621700000 vertices
>         Rank 1: assembler memory usage: 36319196 KiB
>         Rank 12 has 621700000 vertices
>         Rank 12: assembler memory usage: 36323280 KiB
>         Rank 6 has 621700000 vertices
>         .....
>         .....
>         Rank 5 is counting k-mers in sequence reads [11000001/22166944]
>         Speed RAY_SLAVE_MODE_ADD_VERTICES 0 units/second
>         Estimated remaining time for this step: -8 seconds
>         "
>
>
>
>
>
>         On Fri, Jun 14, 2013 at 3:24 PM, Sébastien Boisvert 
> <sebastien.boisvert.3@ulaval.__ca <mailto:[email protected]> 
> <mailto:sebastien.boisvert.3@__ulaval.ca 
> <mailto:[email protected]>>> wrote:
>
>              Hello,
>
>              I don't really understand what is the problem.
>
>              You said that one of your Ray processes is receiving a SIGHUP 
> signal, even if you are running
>              the whole thing with nohup, right ?
>
>
>              One explanation could be that another user is sending SIGHUP 
> with the kill program to your Ray processes.
>
>
>              Can you try running your job inside screen or tmux ?
>
>
>              On 13/06/13 01:27 PM, Lin wrote:
>
>                  On 12/06/13 05:10 PM, Lin wrote:
>
>                       Hi,
>
>                       Yes, they are.  When I run "top" or "ps". There are 
> exactly 16 Ray ranks and one mpiexec process in the oak machine.
>
>
>                  But this is before one of the Ray ranks receives a SIGHUP 
> (1, Hangup), right ?
>
>                  Yes. Sometime the SIGHUP will lead to not only one process 
> killed.
>
>                  This is the newest error message I got.
>                  """
>                  mpiexec noticed that process rank 8 with PID 18757 on node 
> oak exited on signal 1 (Hangup).
>                  
> ------------------------------____----------------------------__--__--------------
>
>                  3 total processes killed (some possibly by mpiexec during 
> cleanup)
>                  """
>
>
>                       But this problem does not always happen because I have 
> gotten some good results from Ray when I ran it for other datasets.
>
>
>                  I never got this SIGHUP with Ray. That's strange.
>
>                  Is it reproducible, meaning that if you run the same thing 
> 10 times, do you get this SIGHUP 10 times too ?
>
>                  I can not say it is totally reproducible. But if I run the 
> same thing 10 times. I guess 9 of them will be failed.
>
>                  Yes, it is really strange because I did not get any error 
> when I ran it the first several times.
>
>
>
>
>                  On Thu, Jun 13, 2013 at 7:59 AM, Sébastien Boisvert 
> <sebastien.boisvert.3@ulaval.____ca <mailto:sebastien.boisvert.3@__ulaval.ca 
> <mailto:[email protected]>> <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> 
> <mailto:sebastien.boisvert.3@__ulaval.ca 
> <mailto:[email protected]>>>> wrote:
>
>                       On 12/06/13 05:10 PM, Lin wrote:
>
>                           Hi,
>
>                           Yes, they are.  When I run "top" or "ps". There are 
> exactly 16 Ray ranks and one mpiexec process in the oak machine.
>
>
>                       But this is before one of the Ray ranks receives a 
> SIGHUP (1, Hangup), right ?
>
>
>
>                           But this problem does not always happen because I 
> have gotten some good results from Ray when I ran it for other datasets.
>
>
>                       I never got this SIGHUP with Ray. That's strange.
>
>                       Is it reproducible, meaning that if you run the same 
> thing 10 times, do you get this SIGHUP 10 times too ?
>
>
>                           Thanks
>                           Lin
>
>
>
>                           On Wed, Jun 12, 2013 at 8:00 AM, Sébastien Boisvert 
> <sebastien.boisvert.3@ulaval.______ca <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> 
> <mailto:sebastien.boisvert.3@__ulaval.ca 
> <mailto:[email protected]>>> <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@>>______ulaval.ca <http://ulaval.ca> 
> <http://ulaval.ca> <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> 
> <mailto:sebastien.boisvert.3@__ulaval.ca 
> <mailto:[email protected]>>>>> wrote:
>
>                                On 10/06/13 05:26 PM, Lin wrote:
>
>                                    Hi,
>
>                                    Thanks for your answers.
>                                    However, I got the error message from 
> nohup.out. That is to say, I have used nohup to run Ray.
>
>                                    This is my command:
>                                    nohup mpiexec -n 16 Ray Col.conf &
>
>
>                                Are all your MPI ranks running on the "oak" 
> machine ?
>
>
>                                    And the Col.conf contains:
>
>                                    -k 55  # this is a comment
>                                    -p 
> /s/oak/a/nobackup/lin/Art/Col_________illumina_art/Col_il1.__fastq
>                                         
> /s/oak/a/nobackup/lin/Art/Col_________illumina_art/Col_il2.__fastq
>
>                                    -o RayOutputOfCol
>
>
>
>
>
>
>
>                                    On Mon, Jun 10, 2013 at 2:02 PM, Sébastien 
> Boisvert <sebastien.boisvert.3@ulaval.________ca 
> <mailto:sebastien.boisvert.3@ <mailto:sebastien.boisvert.3@> 
> <mailto:sebastien.boisvert.3@ <mailto:sebastien.boisvert.3@>>______ulaval.ca 
> <http://ulaval.ca> <http://ulaval.ca> <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> 
> <mailto:sebastien.boisvert.3@__ulaval.ca 
> <mailto:[email protected]>>>> <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@>> <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@>>__>______ulaval.ca <http://ulaval.ca> 
> <http://ulaval.ca> <http://ulaval.ca> <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@> <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@>>______ulaval.ca <http://ulaval.ca> 
> <http://ulaval.ca>
>         <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> 
> <mailto:sebastien.boisvert.3@__ulaval.ca 
> <mailto:[email protected]>>>>>> wrote:
>
>                                         On 09/06/13 11:35 AM, Lin wrote:
>
>                                             Hi, Sébastien
>
>                                             I changed the Max Kmer to 64. And 
> set it as 55 in a run.
>                                             But it always end up with a 
> problem like this.
>                                             "mpiexec noticed that process 
> rank 11 with PID 25012 on node oak exited on signal 1(Hangup)"
>                                             Could you help me figure it out?
>
>
>                                         The signal 1 is SIGHUP according to 
> this list:
>
>                                         $ kill -l
>                                           1) SIGHUP       2) SIGINT       3) 
> SIGQUIT      4) SIGILL       5) SIGTRAP
>                                           6) SIGABRT      7) SIGBUS       8) 
> SIGFPE       9) SIGKILL     10) SIGUSR1
>                                         11) SIGSEGV     12) SIGUSR2     13) 
> SIGPIPE     14) SIGALRM     15) SIGTERM
>                                         16) SIGSTKFLT   17) SIGCHLD     18) 
> SIGCONT     19) SIGSTOP     20) SIGTSTP
>                                         21) SIGTTIN     22) SIGTTOU     23) 
> SIGURG      24) SIGXCPU     25) SIGXFSZ
>                                         26) SIGVTALRM   27) SIGPROF     28) 
> SIGWINCH    29) SIGIO       30) SIGPWR
>                                         31) SIGSYS      34) SIGRTMIN    35) 
> SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3
>                                         38) SIGRTMIN+4  39) SIGRTMIN+5  40) 
> SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
>                                         43) SIGRTMIN+9  44) SIGRTMIN+10 45) 
> SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
>                                         48) SIGRTMIN+14 49) SIGRTMIN+15 50) 
> SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
>                                         53) SIGRTMAX-11 54) SIGRTMAX-10 55) 
> SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7
>                                         58) SIGRTMAX-6  59) SIGRTMAX-5  60) 
> SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
>                                         63) SIGRTMAX-1  64) SIGRTMAX
>
>
>                                         This signal is not related to the 
> compilation option MAXKMERLENGTH=64.
>
>                                         You are gettig this signal because 
> the parent process of your mpiexec process dies
>                                         (probably because you are closing 
> your terminal) and this causes the SIGHUP that is being sent to your Ray 
> processes.
>
>
>                                         There are several solutions to this 
> issue (pick up on solution in the list below):
>
>
>                                         1. Use nohup^(i.e.: nohup mpiexec -n 
> 999 Ray -p data1.fastq.gz data2.fastq.gz
>
>                                         2. Launch your work inside a screen 
> session (the screen command)
>
>                                         3. Launch your work inside a tmux 
> session (the tmux command)
>
>                                         4. Use a job scheduler (like Moab, 
> Grid Engine, or another).
>
>
>                                         --SÉB--
>
>
>                                             
> ------------------------------__________----------------------__--__--__--__--__--------------__----
>
>
>
>
>                                             How ServiceNow helps IT people 
> transform IT departments:
>                                             1. A cloud service to automate IT 
> design, transition and operations
>                                             2. Dashboards that offer 
> high-level views of enterprise services
>                                             3. A single system of record for 
> all IT processes
>         http://p.sf.net/sfu/__________servicenow-d2d-j 
> <http://p.sf.net/sfu/________servicenow-d2d-j> 
> <http://p.sf.net/sfu/________servicenow-d2d-j 
> <http://p.sf.net/sfu/______servicenow-d2d-j>> 
> <http://p.sf.net/sfu/________servicenow-d2d-j 
> <http://p.sf.net/sfu/______servicenow-d2d-j> 
> <http://p.sf.net/sfu/______servicenow-d2d-j 
> <http://p.sf.net/sfu/____servicenow-d2d-j>>> 
> <http://p.sf.net/sfu/________servicenow-d2d-j 
> <http://p.sf.net/sfu/______servicenow-d2d-j> 
> <http://p.sf.net/sfu/______servicenow-d2d-j 
> <http://p.sf.net/sfu/____servicenow-d2d-j>> 
> <http://p.sf.net/sfu/______servicenow-d2d-j 
> <http://p.sf.net/sfu/____servicenow-d2d-j> 
> <http://p.sf.net/sfu/____servicenow-d2d-j 
> <http://p.sf.net/sfu/__servicenow-d2d-j>>>> 
> <http://p.sf.net/sfu/________servicenow-d2d-j 
> <http://p.sf.net/sfu/______servicenow-d2d-j> 
> <http://p.sf.net/sfu/______servicenow-d2d-j 
> <http://p.sf.net/sfu/____servicenow-d2d-j>> 
> <http://p.sf.net/sfu/______servicenow-d2d-j
>         <http://p.sf.net/sfu/____servicenow-d2d-j> 
> <http://p.sf.net/sfu/____servicenow-d2d-j 
> <http://p.sf.net/sfu/__servicenow-d2d-j>>> 
> <http://p.sf.net/sfu/______servicenow-d2d-j 
> <http://p.sf.net/sfu/____servicenow-d2d-j> 
> <http://p.sf.net/sfu/____servicenow-d2d-j 
> <http://p.sf.net/sfu/__servicenow-d2d-j>> 
> <http://p.sf.net/sfu/____servicenow-d2d-j 
> <http://p.sf.net/sfu/__servicenow-d2d-j> 
> <http://p.sf.net/sfu/__servicenow-d2d-j 
> <http://p.sf.net/sfu/servicenow-d2d-j>>>>>
>                                             
> _________________________________________________________
>                                             Denovoassembler-users mailing list
>                                             
> Denovoassembler-users@lists.__________sourceforge.net 
> <http://sourceforge.net> <http://sourceforge.net> <http://sourceforge.net> 
> <http://sourceforge.net> <mailto:Denovoassembler-users@ 
> <mailto:Denovoassembler-users@> <mailto:Denovoassembler-users@ 
> <mailto:Denovoassembler-users@>__> <mailto:Denovoassembler-users@ 
> <mailto:Denovoassembler-users@> <mailto:Denovoassembler-users@ 
> <mailto:Denovoassembler-users@>__>__>____lists.sourceforge.net 
> <http://lists.sourceforge.net> <http://lists.sourceforge.net> 
> <http://lists.sourceforge.net> <mailto:Denovoassembler-users@ 
> <mailto:Denovoassembler-users@> <mailto:Denovoassembler-users@ 
> <mailto:Denovoassembler-users@>__>____lists.sourceforge.net 
> <http://lists.sourceforge.net> <http://lists.sourceforge.net> 
> <mailto:Denovoassembler-users@ 
> <mailto:Denovoassembler-users@>____lists.sourceforge.net 
> <http://lists.sourceforge.net> 
> <mailto:Denovoassembler-users@__lists.sourceforge.net
>         <mailto:[email protected]>>>>>
>         
> https://lists.sourceforge.net/__________lists/listinfo/__________denovoassembler-users
>  
> <https://lists.sourceforge.net/________lists/listinfo/________denovoassembler-users>
>  
> <https://lists.sourceforge.__net/______lists/listinfo/________denovoassembler-users
>  
> <https://lists.sourceforge.net/______lists/listinfo/______denovoassembler-users>>
>  
> <https://lists.sourceforge.____net/____lists/listinfo/________denovoassembler-users
>  
> <https://lists.sourceforge.__net/____lists/listinfo/______denovoassembler-users
>  
> <https://lists.sourceforge.net/____lists/listinfo/____denovoassembler-users>>>
>  
> <https://lists.sourceforge.______net/__lists/listinfo/________denovoassembler-users
>  
> <https://lists.sourceforge.____net/__lists/listinfo/______denovoassembler-users
>  <https://lists.sourceforge.__net/__lists/listinfo/____denovoassembler-users 
> <https://lists.sourceforge.net/__lists/listinfo/__denovoassembler-users>>>> 
> <https://lists.sourceforge.________net/lists/listinfo/________denovoassembler-users
>         
> <https://lists.sourceforge.______net/lists/listinfo/______denovoassembler-users
>  <https://lists.sourceforge.____net/lists/listinfo/____denovoassembler-users 
> <https://lists.sourceforge.__net/lists/listinfo/__denovoassembler-users 
> <https://lists.sourceforge.net/lists/listinfo/denovoassembler-users>>>>>
>
>
>
>
>
>
>
>
>
>
>


------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to