Hello,

I don't really understand what is the problem.

You said that one of your Ray processes is receiving a SIGHUP signal, even if 
you are running
the whole thing with nohup, right ?


One explanation could be that another user is sending SIGHUP with the kill 
program to your Ray processes.


Can you try running your job inside screen or tmux ?

On 13/06/13 01:27 PM, Lin wrote:
> On 12/06/13 05:10 PM, Lin wrote:
>
>     Hi,
>
>     Yes, they are.  When I run "top" or "ps". There are exactly 16 Ray ranks 
> and one mpiexec process in the oak machine.
>
>
> But this is before one of the Ray ranks receives a SIGHUP (1, Hangup), right ?
>
> Yes. Sometime the SIGHUP will lead to not only one process killed.
>
> This is the newest error message I got.
> """
> mpiexec noticed that process rank 8 with PID 18757 on node oak exited on 
> signal 1 (Hangup).
> --------------------------------------------------------------------------
> 3 total processes killed (some possibly by mpiexec during cleanup)
> """
>
>
>     But this problem does not always happen because I have gotten some good 
> results from Ray when I ran it for other datasets.
>
>
> I never got this SIGHUP with Ray. That's strange.
>
> Is it reproducible, meaning that if you run the same thing 10 times, do you 
> get this SIGHUP 10 times too ?
>
> I can not say it is totally reproducible. But if I run the same thing 10 
> times. I guess 9 of them will be failed.
>
> Yes, it is really strange because I did not get any error when I ran it the 
> first several times.
>
>
>
>
> On Thu, Jun 13, 2013 at 7:59 AM, Sébastien Boisvert 
> <[email protected] <mailto:[email protected]>> 
> wrote:
>
>     On 12/06/13 05:10 PM, Lin wrote:
>
>         Hi,
>
>         Yes, they are.  When I run "top" or "ps". There are exactly 16 Ray 
> ranks and one mpiexec process in the oak machine.
>
>
>     But this is before one of the Ray ranks receives a SIGHUP (1, Hangup), 
> right ?
>
>
>
>         But this problem does not always happen because I have gotten some 
> good results from Ray when I ran it for other datasets.
>
>
>     I never got this SIGHUP with Ray. That's strange.
>
>     Is it reproducible, meaning that if you run the same thing 10 times, do 
> you get this SIGHUP 10 times too ?
>
>
>         Thanks
>         Lin
>
>
>
>         On Wed, Jun 12, 2013 at 8:00 AM, Sébastien Boisvert 
> <sebastien.boisvert.3@ulaval.__ca <mailto:[email protected]> 
> <mailto:sebastien.boisvert.3@__ulaval.ca 
> <mailto:[email protected]>>> wrote:
>
>              On 10/06/13 05:26 PM, Lin wrote:
>
>                  Hi,
>
>                  Thanks for your answers.
>                  However, I got the error message from nohup.out. That is to 
> say, I have used nohup to run Ray.
>
>                  This is my command:
>                  nohup mpiexec -n 16 Ray Col.conf &
>
>
>              Are all your MPI ranks running on the "oak" machine ?
>
>
>                  And the Col.conf contains:
>
>                  -k 55  # this is a comment
>                  -p 
> /s/oak/a/nobackup/lin/Art/Col_____illumina_art/Col_il1.fastq
>                       
> /s/oak/a/nobackup/lin/Art/Col_____illumina_art/Col_il2.fastq
>
>                  -o RayOutputOfCol
>
>
>
>
>
>                  On Mon, Jun 10, 2013 at 2:02 PM, Sébastien Boisvert 
> <sebastien.boisvert.3@ulaval.____ca <mailto:sebastien.boisvert.3@__ulaval.ca 
> <mailto:[email protected]>> <mailto:sebastien.boisvert.3@ 
> <mailto:sebastien.boisvert.3@>____ulaval.ca <http://ulaval.ca> 
> <mailto:sebastien.boisvert.3@__ulaval.ca 
> <mailto:[email protected]>>>> wrote:
>
>                       On 09/06/13 11:35 AM, Lin wrote:
>
>                           Hi, Sébastien
>
>                           I changed the Max Kmer to 64. And set it as 55 in a 
> run.
>                           But it always end up with a problem like this.
>                           "mpiexec noticed that process rank 11 with PID 
> 25012 on node oak exited on signal 1(Hangup)"
>                           Could you help me figure it out?
>
>
>                       The signal 1 is SIGHUP according to this list:
>
>                       $ kill -l
>                         1) SIGHUP       2) SIGINT       3) SIGQUIT      4) 
> SIGILL       5) SIGTRAP
>                         6) SIGABRT      7) SIGBUS       8) SIGFPE       9) 
> SIGKILL     10) SIGUSR1
>                       11) SIGSEGV     12) SIGUSR2     13) SIGPIPE     14) 
> SIGALRM     15) SIGTERM
>                       16) SIGSTKFLT   17) SIGCHLD     18) SIGCONT     19) 
> SIGSTOP     20) SIGTSTP
>                       21) SIGTTIN     22) SIGTTOU     23) SIGURG      24) 
> SIGXCPU     25) SIGXFSZ
>                       26) SIGVTALRM   27) SIGPROF     28) SIGWINCH    29) 
> SIGIO       30) SIGPWR
>                       31) SIGSYS      34) SIGRTMIN    35) SIGRTMIN+1  36) 
> SIGRTMIN+2  37) SIGRTMIN+3
>                       38) SIGRTMIN+4  39) SIGRTMIN+5  40) SIGRTMIN+6  41) 
> SIGRTMIN+7  42) SIGRTMIN+8
>                       43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) 
> SIGRTMIN+12 47) SIGRTMIN+13
>                       48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) 
> SIGRTMAX-13 52) SIGRTMAX-12
>                       53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9  56) 
> SIGRTMAX-8  57) SIGRTMAX-7
>                       58) SIGRTMAX-6  59) SIGRTMAX-5  60) SIGRTMAX-4  61) 
> SIGRTMAX-3  62) SIGRTMAX-2
>                       63) SIGRTMAX-1  64) SIGRTMAX
>
>
>                       This signal is not related to the compilation option 
> MAXKMERLENGTH=64.
>
>                       You are gettig this signal because the parent process 
> of your mpiexec process dies
>                       (probably because you are closing your terminal) and 
> this causes the SIGHUP that is being sent to your Ray processes.
>
>
>                       There are several solutions to this issue (pick up on 
> solution in the list below):
>
>
>                       1. Use nohup^(i.e.: nohup mpiexec -n 999 Ray -p 
> data1.fastq.gz data2.fastq.gz
>
>                       2. Launch your work inside a screen session (the screen 
> command)
>
>                       3. Launch your work inside a tmux session (the tmux 
> command)
>
>                       4. Use a job scheduler (like Moab, Grid Engine, or 
> another).
>
>
>                       --SÉB--
>
>
>                           
> ------------------------------______--------------------------__--__--__------------------
>
>
>                           How ServiceNow helps IT people transform IT 
> departments:
>                           1. A cloud service to automate IT design, 
> transition and operations
>                           2. Dashboards that offer high-level views of 
> enterprise services
>                           3. A single system of record for all IT processes
>         http://p.sf.net/sfu/______servicenow-d2d-j 
> <http://p.sf.net/sfu/____servicenow-d2d-j> 
> <http://p.sf.net/sfu/____servicenow-d2d-j 
> <http://p.sf.net/sfu/__servicenow-d2d-j>> 
> <http://p.sf.net/sfu/____servicenow-d2d-j 
> <http://p.sf.net/sfu/__servicenow-d2d-j> 
> <http://p.sf.net/sfu/__servicenow-d2d-j 
> <http://p.sf.net/sfu/servicenow-d2d-j>>>
>                           
> _____________________________________________________
>                           Denovoassembler-users mailing list
>                           Denovoassembler-users@lists.______sourceforge.net 
> <http://sourceforge.net> <http://sourceforge.net> 
> <mailto:Denovoassembler-users@ 
> <mailto:Denovoassembler-users@>____lists.sourceforge.net 
> <http://lists.sourceforge.net> 
> <mailto:Denovoassembler-users@__lists.sourceforge.net 
> <mailto:[email protected]>>>
>         
> https://lists.sourceforge.net/______lists/listinfo/______denovoassembler-users
>  <https://lists.sourceforge.net/____lists/listinfo/____denovoassembler-users> 
> <https://lists.sourceforge.__net/__lists/listinfo/____denovoassembler-users 
> <https://lists.sourceforge.net/__lists/listinfo/__denovoassembler-users>> 
> <https://lists.sourceforge.____net/lists/listinfo/____denovoassembler-users 
> <https://lists.sourceforge.__net/lists/listinfo/__denovoassembler-users 
> <https://lists.sourceforge.net/lists/listinfo/denovoassembler-users>>>
>
>
>
>
>
>
>


------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to