Hi, Yes, i am able to ping all the nodes connected with Infiniband switch For more details please go through the attachment.
Thanks Atul Yadav On Sat, Apr 12, 2014 at 7:28 PM, Hal Rosenstock <[email protected]>wrote: > On 4/12/2014 6:59 AM, Atul Yadav wrote: > > HI, > > > > Thanks for replying > > In this artectuire, when we are doing ibv_rc_pingpong between two nodes > > connected with same switch we are getting result. But when we use two > > nodes with 2 switches we are getting error. > > > > Success:- > > [root@oss1 ~]# ibv_rc_pingpong > > local address: LID 0x001e, QPN 0x2c004a, PSN 0x554863, GID :: > > remote address: LID 0x0022, QPN 0x20004a, PSN 0x7c9dc2, GID :: > > 8192000 bytes in 0.01 seconds = 6992.74 Mbit/sec > > 1000 iters in 0.01 seconds = 9.37 usec/iter > > [root@oss1 ~]# > > > > [root@mds1 ~]# ibv_rc_pingpong 173.16.1.52 > > local address: LID 0x0022, QPN 0x20004a, PSN 0x7c9dc2, GID :: > > remote address: LID 0x001e, QPN 0x2c004a, PSN 0x554863, GID :: > > 8192000 bytes in 0.01 seconds = 7084.97 Mbit/sec > > 1000 iters in 0.01 seconds = 9.25 usec/iter > > [root@mds1 ~]# > > > > > > > > > > Error > > [root@nalanda mvapich2-1.9]# ibv_rc_pingpong > > local address: LID 0x0001, QPN 0x56004e, PSN 0x704d51 > > remote address: LID 0x0022, QPN 0x1c004a, PSN 0x07a0b2 > > > > [root@mds1 ~]# ibv_rc_pingpong 173.16.1.1 > > local address: LID 0x0022, QPN 0x1c004a, PSN 0x07a0b2, GID :: > > client read: Success > > Couldn't read remote address > > [root@mds1 ~]# > > Looking at libibverbs/examples/rc_pingpong.c: > > static struct pingpong_dest *pp_client_exch_dest(const char *servername, > int port, > const struct > pingpong_dest *my_dest) > { > ... > gid_to_wire_gid(&my_dest->gid, gid); > sprintf(msg, "%04x:%06x:%06x:%s", my_dest->lid, my_dest->qpn, > my_dest->psn, gid); > if (write(sockfd, msg, sizeof msg) != sizeof msg) { > fprintf(stderr, "Couldn't send local address\n"); > goto out; > } > > > if (read(sockfd, msg, sizeof msg) != sizeof msg) { > perror("client read"); > fprintf(stderr, "Couldn't read remote address\n"); > goto out; > } > > This read is failing for some reason. This is some message exchange over > some IP network (for example, IPoIB or ethernet). > > > > > And how we test our ftree topology is working fine. > > > > Please go through the attachment. > > Looks like LIDs are assigned but can't tell about routing from info > supplied but topology looks relatively simple (5 switches, homogenous 4x > QDR links). Is the OpenSM log clean ? Any fat tree related messages. This > is likely not SM issue. > > The next issues are end node related (probably with IPoIB configuration). > Can you ping between the nodes which fail rc_pingpong ? If not, > > -- Hal > > > > > Thank You > > Atul Yadav > > > > > > On Sat, Apr 12, 2014 at 12:14 AM, Hal Rosenstock <[email protected] > > <mailto:[email protected]>> wrote: > > > > On 4/11/2014 2:21 PM, Atul Yadav wrote: > > > Dear Team, > > > > > > We are trying to build Fat tree topology. > > > The details are given below: > > > Unmanaged switches 36 port quantity 5 > > > As per the some blog we need to modify the opensm.conf file > > > But we are unable to identify some parameter like:- > > > root_guid_file ??????? > > > > Fat tree routing will try to autodetect the roots but this may not > > work and it is better to specify the root GUIDs. In your case, they > > are the GUIDs for switches A and B. > > > > The root GUID file is then provided to OpenSM either via the conf > > file or command line parameters. The command line parameter is [-a | > > --root_guid_file <path to file>] > > > > OpenSM man page says: > > > > -a, --root_guid_file <file name> > > Set the root nodes for the Up/Down or Fat-Tree routing > > algorithm > > to the guids provided in the given file (one to a > line). > > > > It also says: > > > > If the root guid file is not provided (?-a? or > > ?--root_guid_file? > > options), the topology has to be pure fat-tree that > > complies with the > > following rules: > > - Tree rank should be between two and eight (inclusively) > > - Switches of the same rank should have the same number > > of UP-going port groups*, unless they are root switches, > > in which case the shouldn?t have UP-going ports at all. > > - Switches of the same rank should have the same number > > of DOWN-going port groups, unless they are leaf switches. > > - Switches of the same rank should have the same number > > of ports in each UP-going port group. > > - Switches of the same rank should have the same number > > of ports in each DOWN-going port group. > > - All the CAs have to be at the same tree level (rank). > > > > If the root guid file is provided, the topology doesn?t have > > to be pure > > fat-tree, and it should only comply with the following rules: > > - Tree rank should be between two and eight (inclusively) > > - All the Compute Nodes** have to be at the same tree level > > (rank). > > Note that non-compute node CAs are allowed here to be at > > different > > tree ranks. > > > > * ports that are connected to the same remote switch are > > referenced as > > port group. > > > > ** list of compute nodes (CNs) can be specified by > > -u or > > --cn_guid_file OpenSM options. > > > > -- Hal > > > > > > > > Need your input for this ? > > > > > > > > > > > > > > > Thank You > > > Atul Yadav > > > > > > > > > > > > > > > _______________________________________________ > > > ewg mailing list > > > [email protected] <mailto:[email protected]> > > > http://lists.openfabrics.org/mailman/listinfo/ewg > > > > > >
------------------------------------------------- OpenSM 3.3.5 Reading Cached Option File: /etc/rdma/opensm.conf Loading Cached Option:guid = 0x0002c9030042e421 Loading Cached Option:sweep_interval = 120 Loading Cached Option:routing_engine = ftree Loading Cached Option:use_ucast_cache = TRUE Loading Cached Option:root_guid_file = /etc/rdma/guid Command Line Arguments: Daemon mode Log File: /var/log/opensm.log ------------------------------------------------- OpenSM 3.3.5 Apr 12 20:44:15 982804 [67F8C700] 0x80 -> OpenSM 3.3.5 ------------------------------------------------- OpenSM 3.3.5 Reading Cached Option File: /etc/rdma/opensm.conf Loading Cached Option:guid = 0x0002c9030042e421 Loading Cached Option:sweep_interval = 120 Loading Cached Option:routing_engine = ftree Loading Cached Option:use_ucast_cache = TRUE Loading Cached Option:root_guid_file = /etc/rdma/guid Command Line Arguments: Daemon mode Log File: /var/log/opensm.log ------------------------------------------------- OpenSM 3.3.5 Apr 12 20:44:15 982804 [67F8C700] 0x80 -> OpenSM 3.3.5 Entering DISCOVERING state Apr 12 20:44:15 984514 [67F8C700] 0x02 -> osm_vendor_init: 1000 pending umads specified Apr 12 20:44:15 984702 [67F8C700] 0x80 -> Entering DISCOVERING state Entering MASTER state Apr 12 20:44:15 984761 [67F8C700] 0x02 -> osm_vendor_bind: Binding to port 0x2c9030042e421 Apr 12 20:44:16 027506 [67F8C700] 0x02 -> osm_vendor_bind: Binding to port 0x2c9030042e421 Apr 12 20:44:16 027558 [67F8C700] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x0002c9030042e421 Apr 12 20:44:16 069014 [5CB78700] 0x80 -> Entering MASTER state SUBNET UP Apr 12 20:44:16 075363 [5CB78700] 0x02 -> fabric_dump_general_info: General fabric topology info Apr 12 20:44:16 075368 [5CB78700] 0x02 -> fabric_dump_general_info: ============================ Apr 12 20:44:16 075371 [5CB78700] 0x02 -> fabric_dump_general_info: - FatTree rank (roots to leaf switches): 2 Apr 12 20:44:16 075372 [5CB78700] 0x02 -> fabric_dump_general_info: - FatTree max switch rank: 1 Apr 12 20:44:16 075374 [5CB78700] 0x02 -> fabric_dump_general_info: - Fabric has 39 CAs, 39 CA ports (39 of them CNs), 5 switches Apr 12 20:44:16 075376 [5CB78700] 0x02 -> fabric_dump_general_info: - Fabric has 2 switches at rank 0 (roots) Apr 12 20:44:16 075378 [5CB78700] 0x02 -> fabric_dump_general_info: - Fabric has 3 switches at rank 1 (3 of them leafs) Apr 12 20:44:16 075511 [5CB78700] 0x02 -> osm_ucast_mgr_process: ftree tables configured on all switches Apr 12 20:44:16 098151 [5CB78700] 0x80 -> SUBNET UP Apr 12 20:44:16 277047 [6077E700] 0x01 -> log_trap_info: Received Generic Notice type:4 num:144 (CapabilityMask, NodeDescription, Link [Width|Speed] Enabled, SM priority changed) Producer:1 (Channel Adapter) from LID:1 TID:0x000000000000003c Apr 12 20:44:16 277090 [6077E700] 0x02 -> trap_rcv_process_request: Trap 144 Node description update Apr 12 20:44:16 277105 [6077E700] 0x02 -> log_notice: Reporting Generic Notice type:4 num:144 (CapabilityMask, NodeDescription, Link [Width|Speed] Enabled, SM priority changed) from LID:1 GID:fe80::2:c903:42:e421 Apr 12 20:44:16 298708 [5CB78700] 0x02 -> osm_ucast_cache_process: Configuring switch tables using cached routing Apr 12 20:44:16 299811 [5CB78700] 0x02 -> SUBNET UP Apr 12 20:44:18 072914 [61B80700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::ffff:ffff Apr 12 20:44:18 073968 [5E97B700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff85:6d01 Apr 12 20:44:18 074165 [5F37C700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e421 Apr 12 20:44:18 074791 [64384700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::1 Apr 12 20:44:18 074855 [67589700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff85:cc21 Apr 12 20:44:18 074905 [61B80700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff85:cc31 Apr 12 20:44:18 074940 [6117F700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff52:498f Apr 12 20:44:18 075018 [5DF7A700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff52:4983 Apr 12 20:44:18 075062 [62F82700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::fb Apr 12 20:44:18 075126 [5DF7A700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff41:f801 Apr 12 20:44:18 075561 [5FD7D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff52:4993 Apr 12 20:44:18 075653 [62581700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1 Apr 12 20:44:18 076163 [5E97B700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:3a11 Apr 12 20:44:18 076192 [5E97B700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff52:499b Apr 12 20:44:18 076299 [6117F700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e301 Apr 12 20:44:18 076331 [6117F700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::202 Apr 12 20:44:18 076354 [6117F700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff41:f841 Apr 12 20:44:18 076419 [65786700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e321 Apr 12 20:44:18 076851 [61B80700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e331 Apr 12 20:44:18 076932 [62F82700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff85:cc61 Apr 12 20:44:18 077120 [62581700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e441 Apr 12 20:44:18 077399 [66B88700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e511 Apr 12 20:44:18 077635 [64D85700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:6
ibdiagnet-report.tar.gz
Description: GNU Zip compressed data
opensm.conf
Description: Binary data
_______________________________________________ ewg mailing list [email protected] http://lists.openfabrics.org/mailman/listinfo/ewg
