Jeff,

On Mon, Jun 2, 2014 at 7:26 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> > i faced a bit different problem, but that is 100% reproductible :
> > - i launch mpirun (no batch manager) from a node with one IB port
> > - i use -host node01,node02 where node01 and node02 both have two IB
> port on the
> >   same subnet
>
> FWIW: 2 IB ports on the same subnet?  That's not a good idea.
>
> could you please elaborate a bit ?
from what i saw, this basically doubles the bandwidth (imb PingPong
benchmark) between two nodes (!) which is a not a bad thing.
i can only guess this might not scale (e.g. if 16 tasks is running on each
host, the overhead associated with the use of two ports might void the
extra bandwidth)


> > by default, this will hang.
>
> ...but it still shouldn't hang.  I wonder if it's somehow related to
> https://svn.open-mpi.org/trac/ompi/ticket/4442...?
>
>  i doubt it ...

here is my command line (from node0)
`which mpirun` -np 2 -host node1,node2 --mca rtc_freq_priority 0 --mca btl
openib,self --mca btl_openib_if_include mlx4_0 ./abort
on top of that, the usnic btl is not built (nor installed)


> if this is a "feature" (e.g. openmpi does not support this kind of
> configuration) i am fine with it.
> >
> > when i run mpirun --mca btl_openib_if_exclude mlx4_1, then if the
> application is a success, then it works just fine.
> >
> > if the application calls MPI_Abort() /* and even if all tasks call
> MPI_Abort() */ then it will hang 100% of the time.
> > i do not see that as a feature but as a bug.
>
> Yes, OMPI should never hang upon a call to MPI_Abort.
>
> Can you get some stack traces to show where the hung process(es) are
> stuck?  That would help Ralph pin down where things aren't working down in
> ORTE.
>

on node0 :

  \_ -bash
      \_ /.../local/ompi-trunk/bin/mpirun -np 2 -host node1,node2 --mca
rtc_freq_priority 0 --mc
          \_ /usr/bin/ssh -x node1     PATH=/.../local/ompi-trunk/bin:$PATH
; export PATH ; LD_LIBRAR
          \_ /usr/bin/ssh -x node2     PATH=/.../local/ompi-trunk/bin:$PATH
; export PATH ; LD_LIBRAR


pstack (mpirun) :
$ pstack 10913
Thread 2 (Thread 0x7f0ecad35700 (LWP 10914)):
#0  0x0000003ba66e15e3 in select () from /lib64/libc.so.6
#1  0x00007f0ecad4391e in listen_thread () from
/.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
#2  0x0000003ba72079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003ba66e8b6d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f0ecc601700 (LWP 10913)):
#0  0x0000003ba66df343 in poll () from /lib64/libc.so.6
#1  0x00007f0ecc6b1a05 in poll_dispatch () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#2  0x00007f0ecc6a641c in opal_libevent2021_event_base_loop () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#3  0x00000000004056a1 in orterun ()
#4  0x00000000004039f4 in main ()


on node 1 :

 sshd: gouaillardet@notty
  \_ bash -c     PATH=/.../local/ompi-trunk/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/...
      \_ /.../local/ompi-trunk/bin/orted -mca ess env -mca orte_ess_jobid
3459448832 -mca orte_ess_vpid
          \_ [abort] <defunct>

$ pstack (orted)
#0  0x00007fe0ba6a0566 in vfprintf () from /lib64/libc.so.6
#1  0x00007fe0ba6c9a52 in vsnprintf () from /lib64/libc.so.6
#2  0x00007fe0ba6a9523 in snprintf () from /lib64/libc.so.6
#3  0x00007fe0bbc019b6 in orte_util_print_jobids () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#4  0x00007fe0bbc01791 in orte_util_print_name_args () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#5  0x00007fe0b8e16a8b in mca_oob_tcp_component_hop_unknown () from
/.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
#6  0x00007fe0bb94ab7a in event_process_active_single_queue () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#7  0x00007fe0bb94adf2 in event_process_active () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#8  0x00007fe0bb94b470 in opal_libevent2021_event_base_loop () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#9  0x00007fe0bbc1fa7b in orte_daemon () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#10 0x000000000040093a in main ()


on node 2 :

 sshd: gouaillardet@notty
  \_ bash -c     PATH=/.../local/ompi-trunk/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/...
      \_ /.../local/ompi-trunk/bin/orted -mca ess env -mca orte_ess_jobid
3459448832 -mca orte_ess_vpid
          \_ [abort] <defunct>

$ pstack (orted)
#0  0x00007fe8fc435e39 in strchrnul () from /lib64/libc.so.6
#1  0x00007fe8fc3ef8f5 in vfprintf () from /lib64/libc.so.6
#2  0x00007fe8fc41aa52 in vsnprintf () from /lib64/libc.so.6
#3  0x00007fe8fc3fa523 in snprintf () from /lib64/libc.so.6
#4  0x00007fe8fd9529b6 in orte_util_print_jobids () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#5  0x00007fe8fd952791 in orte_util_print_name_args () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#6  0x00007fe8fab6c1b5 in resend () from
/.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
#7  0x00007fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from
/.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
#8  0x00007fe8fd69bb7a in event_process_active_single_queue () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#9  0x00007fe8fd69bdf2 in event_process_active () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#10 0x00007fe8fd69c470 in opal_libevent2021_event_base_loop () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#11 0x00007fe8fd970a7b in orte_daemon () from
/...t/local/ompi-trunk/lib/libopen-rte.so.0
#12 0x000000000040093a in main ()


orted processes loop forever in event_process_active_single_queue
mca_oob_tcp_component_hop_unknown gets called again and again
mca_oob_tcp_component_hop_unknown (fd=-1, args=4, cbdata=0x99dc50) at
../../../../../../src/ompi-trunk/orte/mca/oob/tcp/oob_tcp_component.c:1369

>
> > in an other thread, Jeff mentionned that the usnic btl is doing stuff
> even if there is no usnic hardware (this will be fixed shortly).
> > Do you still see intermittent hang without listing usnic as a btl ?
>
> Yeah, there's a definite race in the usnic BTL ATM.  If you care, here's
> what's happening:
>

thanks for the insights :-)

Cheers,

Gilles

Reply via email to