On Tue, 2015-01-20 at 18:16 +0200, Erez Shitrit wrote:
> On 1/20/2015 5:58 AM, Doug Ledford wrote:
> > These patches are to resolve issues created by my previous patch set.
> > While that set worked fine in my testing, there were problems with
> > multicast joins after the initial set of joins had completed.  Since my
> > testing relied upon the normal set of multicast joins that happen
> > when the interface is first brought up, I missed those problems.
> >
> > Symptoms vary from failure to send packets due to a failed join, to
> > loss of connectivity after a subnet manager restart, to failure
> > to properly release multicast groups on shutdown resulting in hangs
> > when the mlx4 driver attempts to unload itself via its reboot
> > notifier handler.
> >
> > This set of patches has passed a number of tests above and beyond my
> > original tests.  As suggested by Or Gerlitz I added IPv6 and IPv4
> > multicast tests.  I also added both subnet manager restarts and
> > manual shutdown/restart of individual ports at the switch in order to
> > ensure that the ENETRESET path was properly tested.  I included
> > testing, then a subnet manager restart, then a quiescent period for
> > caches to expire, then restarting testing to make sure that arp and
> > neighbor discovery work after the subnet manager restart.
> >
> > All in all, I have not been able to trip the multicast joins up any
> > longer.
> >
> > Additionally, the original impetus for my first 8 patch set was that
> > it was simply too easy to break the IPoIB subsystem with this simple
> > loop:
> >
> > while true; do
> >      ifconfig ib0 up
> >      ifconfig ib0 down
> > done
> >
> > Just to be safe, I made sure this problem did not resurface.
> >
> > Roland, the 3.19-rc code is broken.  We either need to revert my
> > original patchset, or grab these, but I would not recommend leaving
> > it as it currently stands.
> >
> > Doug Ledford (7):
> >    IB/ipoib: Fix failed multicast joins/sends
> >    IB/ipoib: Add a helper to restart the multicast task
> >    IB/ipoib: make delayed tasks not hold up everything
> >    IB/ipoib: Handle -ENETRESET properly in our callback
> >    IB/ipoib: don't restart our thread on ENETRESET
> >    IB/ipoib: remove unneeded locks
> >    IB/ipoib: fix race between mcast_dev_flush and mcast_join
> >
> >   drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
> >   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 204 
> > +++++++++++++++----------
> >   2 files changed, 121 insertions(+), 84 deletions(-)
> >
> Hi Doug,
> 
> After trying your V4 patch series, I can tell that first, the endless 
> scheduling of
> the mcast task is indeed over,

Good.

>  but still, the multicast functionality in 
> ipoib is unstable.

I'm not seeing that here.  Let's try to figure out what's different.

> I see that there are times that ping6 works good, and sometimes it 
> doesn't, to
> make it clear I always use the link-local address assigned by the stack 
> to the
> IPoIB device, see [1] below for how I run it.

As do I.  I'll attach the scripts I used to run it for your reference.

> I also see that send-only mcast stops working from time to time, see [2] 
> below
> for how I run this. I can narrow the problem to be on the sender 
> (client) side,
> since I work with a peer node which has well functioning IPoIB multicast 
> code.

I don't think the peer side really denotes a conclusive argument ;-)

> One more phenomena, that in some cases I can see that the driver (after the
> mcast_debug_level is set) prints endless message:
> "ib0: no address vector, but multicast join already started"

OK, this is to be expected from your tests I think.  In particular, this
is generated by mcast_send() if it's called by your program while the
send only join has not yet completed.  The flow goes like this:

First packet after interface comes up:
mcast_send() -> ipoib_mcast_alloc() -> ipoib_mcast_add() -> schedule join task 
thread

                                In a different thread:
                                mcast_join_task()
                                  find unjoined mcast group
                                  mark mcast->flags with IPOIB_MCAST_FLAG BUSY
                                  -> mcast_join()
                                     send join request over the wire

Back on original thread context:
mcast_send()
  this time we find a matching mcast entry but mcast->ah is NULL
  queue packet, unless backlog is full and then drop packet
  if mcast->flags & IPOIB_MCAST_FLAG_BUSY, emit notice that you see

                                In a different thread:
                                mcast_sendonly_join_complete() ->
                                        mcast_join_finish()
                                          set mcast->ah
                                          send skb backlog queue
                                  clear IPOIB_MCAST_FLAG_BUSY

Back on original thread context:
mcast_send()
  now we find the mcast entry, and we find the mcast->ah entry, so
  sends now proceed as expected with no messages, and any lost packets
  while waiting on mcast->ah to be valid are simply gone

This looks entirely normal to me if your application is busy blasting
packets while the join is happening.  Actually, I think the message is
worthless to be honest.  I would be more interested in a message about
dropping packets than simply a message that denotes we are sending
packets while the join is still in process.

Unless we are sending so many packets out that we are starving the
join's ability to finish.  That would be interesting data to know.  Does
the join never finish in this case?  Also, I think you indicated that
you are running back to back and without a switch?  These joins have to
go to the subnet manager and back.  What is your subnet management like?

> 
> One practical solution here would be to revert the offending commit 3.19-rc1
> 016d9fb "IPoIB: fix MCAST_FLAG_BUSY usage".

It is not practical to revert that patch by itself.  That patch changes
semantics of the mcast->flag usage in such a way that all of my
subsequent patches are broken without it.  They go as a group or not at
all.

> Thanks, Erez
> 
> 1] IPv6 ping
> 
> $ ping6 fe80::202:c903:9f:3b0a -I ib0
> where the IPv6 address is the one displayed by "ip addr show dev ib0" on 
> the remote node

Mine is similar.  I use these two files:
[root@rdma-master testing]$ cat ip6-addresses.txt 
rdma-master     fe80::f652:1403:7b:cba1 mlx4_ib0
rdma-perf-00    fe80::202:c903:31:7791  mlx4_ib0
rdma-perf-01    fe80::f652:1403:7b:e1b1 mlx4_ib0
rdma-perf-02    fe80::211:7500:77:d3cc  qib_ib0
rdma-perf-03    fe80::211:7500:77:d81a  qib_ib0
rdma-storage-01 fe80::f652:1403:7b:e131 mlx4_ib0
rdma-vr6-master fe80::601:1403:7b:cba1  mlx4_ib0
[root@rdma-master testing]$ cat ping_loop 
#!/bin/bash

trap_handler()
{
        exit 0
}

trap 'trap_handler' 2 15

ADDR_FILE=ip6-addresses.txt
ME=`hostname -s`
LOCAL=`awk '/'"$ME"'/ { print $3 }' $ADDR_FILE`
while true; do
        cat $ADDR_FILE | \
        while read host addr dev; do
                [ ${host} = `hostname -s` ] && continue
                ping6 ${addr}%$LOCAL -c 3
        done
done
[root@rdma-master testing]$ 


> 
> [2] IPv4 multicast
> 
> # server
> $ route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0
> $ netserver
> 
> # client
> $ route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0
> $ netperf -H 11.134.33.1 -t omni -- -H 225.5.5.4 -T udp -R 1

I've been using iperf with a slightly different setup:

Each machine is a server:
ip route add 224.0.0.0/4 dev <ib0 device>
iperf -usB 224.3.2.<server #> -i 1 > <host>-iperf-server.out &

Each machine rotates as a client:
iperf_loop &

[root@rdma-master testing]$ cat iperf-addresses.txt 
rdma-master     224.3.2.1
rdma-perf-00    224.3.2.2
rdma-perf-01    224.3.2.3
rdma-perf-02    224.3.2.4
rdma-perf-03    224.3.2.5
rdma-storage-01 224.3.2.6
[root@rdma-master testing]$ cat iperf_loop 
#!/bin/bash

ADDR_FILE=iperf-addresses.txt
ME=`hostname -s`
LOG=${ME}-iperf-client.out
> $LOG
while true; do
        cat $ADDR_FILE | \
        while read host addr ; do
                [ ${host} = $ME ] && continue
                iperf -uc ${addr} -i 1 >> $LOG
        done
done
[root@rdma-master testing]$ 

One of the differences between iperf and netperf is the speed with which
it is blasting the multicast packets out.  iperf sends them at a fairly
sane rate while netperf is balls to the wall.  So I don't see the kernel
messages you posted as a problem, they are simply telling you that
netperf is blasting away at the group while it is coming online.  Unless
they happen infinitely on a single sendonly group, which would indicate
that our join never completed.  If that's the case, we have to find out
why the join never completed.

-- 
Doug Ledford <[email protected]>
              GPG KeyID: 0E572FDD


Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to