Hi Or,
Thanks a lot for your quick response.
The nodes have LID's assigned to them and OpenSM is running fine. The
reason why the test doesn't print out the LID's seems to be because the
test does not print those fields properly when using RDMA_CM for
establishing connections. I've attached the configurations of the two
hosts along with this e-mail. As Jonathan mentioned, we are able to ping
between them.
The issue is intermittent. It happens at times and at other times, things
work fine. Please let us know if you need any more information.
Thx,
Hari.
On Thu, 22 Jul 2010, Jonathan Perkins wrote:
> On Thu, Jul 22, 2010 at 3:15 AM, Or Gerlitz <[email protected]> wrote:
> > Hari Subramoni wrote:
> >> [subra...@amd6 perftest]$ ./ib_rdma_bw -c 172.16.1.5
> >> 11928: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 |
> >> duplex=0 | cma=1 |
> >> 11928: Local address: ?LID 0000, QPN 000000, PSN 0x5bfbba RKey 0x90042602
> >> VAddr 0x002b27feabe000
> >> 11928: Remote address: LID 0000, QPN 000000, PSN 0x392fe6, RKey 0xf8042605
> >> VAddr 0x002b9d5c93b000
> >
> >
> > you can see the lid and qp numbers are zero, something is broken... when
> > you use the rdma-cm,
> > the address to be provided to the utility should be on an IPoIB subnet, is
> > that what you're doing?
> >
> > Basically, I would suggest that you first use rping(1) provided by
> > librdmacm-utils to make
> > sure things are working well in your configuration and then move to the
> > perftest utils.
>
> Thanks for the response Or. I'm posting some information below.
>
> Here is the output I get when running rping...
>
> [perki...@amd5 ~]$ rping -v -s -a 172.16.1.5
>
> [perki...@amd6 ~]$ rping -v -c -a 172.16.1.5
> cq completion failed status 5
> cma event RDMA_CM_EVENT_REJECTED, error 8
> wait for CONNECTED state 10
> connect error -1
> [perki...@amd6 ~]$ ping 172.16.1.5
> PING 172.16.1.5 (172.16.1.5) 56(84) bytes of data.
> 64 bytes from 172.16.1.5: icmp_seq=1 ttl=64 time=3.45 ms
> 64 bytes from 172.16.1.5: icmp_seq=2 ttl=64 time=1.00 ms
>
> We are able to ping the addresses but you can see that rping results
> in a failure.
>
> We have two interfaces exposed on each machine both on different
> subnets (172.16.1.0/24 and 172.16.2.0/24). We're using ofed-1.5.1 on
> these systems. Any idea of what could be going on?
>
> --
> Jonathan Perkins
>
[subra...@amd6 ~]$ ibstat
CA 'mlx4_0'
CA type: MT25418
Number of ports: 2
Firmware version: 2.6.0
Hardware version: a0
Node GUID: 0x0002c9030001e442
System image GUID: 0x0002c9030001e445
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 4
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9030001e443
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c9030001e444
CA 'mlx4_1'
CA type: MT25418
Number of ports: 2
Firmware version: 2.6.0
Hardware version: a0
Node GUID: 0x0002c9030001e44e
System image GUID: 0x0002c9030001e451
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 6
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9030001e44f
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c9030001e450
[subra...@amd6 ~]$
[subra...@amd6 ~]$ ifconfig
eth0 Link encap:Ethernet HWaddr 00:30:48:D0:19:CA
inet addr:164.107.119.237 Bcast:164.107.119.255 Mask:255.255.255.0
inet6 addr: fe80::230:48ff:fed0:19ca/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:132741 errors:0 dropped:0 overruns:0 frame:0
TX packets:51091 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:25346771 (24.1 MiB) TX bytes:18800740 (17.9 MiB)
Base address:0xbc00 Memory:d7fe0000-d8000000
ib0 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:172.16.1.6 Bcast:172.16.1.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:1:e443/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:121 errors:0 dropped:0 overruns:0 frame:0
TX packets:66 errors:0 dropped:10 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:34885 (34.0 KiB) TX bytes:13913 (13.5 KiB)
ib2 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:172.16.2.6 Bcast:172.16.2.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:1:e44f/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:76 errors:0 dropped:0 overruns:0 frame:0
TX packets:48 errors:0 dropped:10 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:24775 (24.1 KiB) TX bytes:15327 (14.9 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:7870 errors:0 dropped:0 overruns:0 frame:0
TX packets:7870 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:855574 (835.5 KiB) TX bytes:855574 (835.5 KiB)
virbr0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:28 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:6175 (6.0 KiB)
[subra...@amd6 ~]$
[subra...@amd6 ~]$ netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
172.16.2.0 0.0.0.0 255.255.255.0 U 0 0 0 ib2
164.107.119.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
172.16.1.0 0.0.0.0 255.255.255.0 U 0 0 0 ib0
192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
0.0.0.0 164.107.119.1 0.0.0.0 UG 0 0 0 eth0
[subra...@amd6 ~]$
[subra...@amd6 ~]$ ping 172.16.1.5
PING 172.16.1.5 (172.16.1.5) 56(84) bytes of data.
64 bytes from 172.16.1.5: icmp_seq=1 ttl=64 time=2.31 ms
64 bytes from 172.16.1.5: icmp_seq=2 ttl=64 time=0.109 ms
64 bytes from 172.16.1.5: icmp_seq=3 ttl=64 time=0.078 ms
--- 172.16.1.5 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.078/0.834/2.315/1.047 ms
[subra...@amd6 ~]$ ping 172.16.1.6
PING 172.16.1.6 (172.16.1.6) 56(84) bytes of data.
64 bytes from 172.16.1.6: icmp_seq=1 ttl=64 time=0.046 ms
64 bytes from 172.16.1.6: icmp_seq=2 ttl=64 time=0.013 ms
64 bytes from 172.16.1.6: icmp_seq=3 ttl=64 time=0.014 ms
--- 172.16.1.6 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.013/0.024/0.046/0.015 ms
[subra...@amd6 ~]$
[subra...@amd5 ~]$ ibstat
CA 'mlx4_0'
CA type: MT25418
Number of ports: 2
Firmware version: 2.6.0
Hardware version: a0
Node GUID: 0x0002c9030001e386
System image GUID: 0x0002c9030001e389
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 12
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9030001e387
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c9030001e388
CA 'mlx4_1'
CA type: MT25418
Number of ports: 2
Firmware version: 2.6.0
Hardware version: a0
Node GUID: 0x0002c9030001e452
System image GUID: 0x0002c9030001e455
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 7
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9030001e453
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c9030001e454
[subra...@amd5 ~]$ ifconfig
eth0 Link encap:Ethernet HWaddr 00:30:48:D0:19:BE
inet addr:164.107.119.236 Bcast:164.107.119.255 Mask:255.255.255.0
inet6 addr: fe80::230:48ff:fed0:19be/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:238196 errors:0 dropped:0 overruns:0 frame:0
TX packets:172491 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:62341710 (59.4 MiB) TX bytes:94768875 (90.3 MiB)
Base address:0xbc00 Memory:d7fe0000-d8000000
ib0 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:172.16.1.5 Bcast:172.16.1.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:1:e387/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:121 errors:0 dropped:0 overruns:0 frame:0
TX packets:78 errors:0 dropped:13 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:47421 (46.3 KiB) TX bytes:21533 (21.0 KiB)
ib2 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:172.16.2.5 Bcast:172.16.2.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:1:e453/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:120 errors:0 dropped:0 overruns:0 frame:0
TX packets:45 errors:0 dropped:13 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:40365 (39.4 KiB) TX bytes:13567 (13.2 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:8506 errors:0 dropped:0 overruns:0 frame:0
TX packets:8506 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:941478 (919.4 KiB) TX bytes:941478 (919.4 KiB)
virbr0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:38 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:7969 (7.7 KiB)
[subra...@amd5 ~]$
[subra...@amd5 ~]$
[subra...@amd5 ~]$ ping 172.16.1.6
PING 172.16.1.6 (172.16.1.6) 56(84) bytes of data.
64 bytes from 172.16.1.6: icmp_seq=1 ttl=64 time=2.23 ms
64 bytes from 172.16.1.6: icmp_seq=2 ttl=64 time=0.111 ms
--- 172.16.1.6 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.111/1.172/2.234/1.062 ms
[subra...@amd5 ~]$ ping 172.16.2.6
PING 172.16.2.6 (172.16.2.6) 56(84) bytes of data.
64 bytes from 172.16.2.6: icmp_seq=1 ttl=64 time=1.70 ms
64 bytes from 172.16.2.6: icmp_seq=2 ttl=64 time=0.104 ms
64 bytes from 172.16.2.6: icmp_seq=3 ttl=64 time=0.083 ms
--- 172.16.2.6 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.083/0.631/1.707/0.760 ms
[subra...@amd5 ~]$
[subra...@amd5 ~]$ netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
172.16.2.0 0.0.0.0 255.255.255.0 U 0 0 0 ib2
164.107.119.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
172.16.1.0 0.0.0.0 255.255.255.0 U 0 0 0 ib0
192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
0.0.0.0 164.107.119.1 0.0.0.0 UG 0 0 0 eth0
[subra...@amd5 ~]$