Hello Sean,

that hint was really the missing piece in the puzzle. Somehow the IP addresses 
were not present in some of the configuration files and after deleting the file 
ibacm_addr.cfg I was able to run ibacm properly.
Thank you very much for the help.

Now I have another problem with 3 out of 18 nodes. All 3 get the correct 
information for the other 15 nodes if I run ib_acme, and also the other 15 can 
obtain the right information for the 3, but if I run ib_acme among those 3 
nodes then I get a "Connection timed out". 
On all three nodes the command for 'localhost' does work, too.

Here the ouput:
====================================================================================
rc001 ~ $ pdsh -w rc0[00-17] 'for x in `seq 100 117`; do ib_acme -f i -d 
10.1.4.${x} -v; done' | grep failed -B 1
rc002: Destination: 10.1.4.106
rc002: ib_acm_resolve_ip failed: Connection timed out
rc002: SA verification: failed Cannot assign requested address
--
rc011: Destination: 10.1.4.102
rc011: ib_acm_resolve_ip failed: Connection timed out
rc011: SA verification: failed Cannot assign requested address
--
rc006: Destination: 10.1.4.102
rc006: ib_acm_resolve_ip failed: Connection timed out
rc006: SA verification: failed Cannot assign requested address
--
rc002: Destination: 10.1.4.111
rc002: ib_acm_resolve_ip failed: Connection timed out
rc002: SA verification: failed Cannot assign requested address
--
rc011: Destination: 10.1.4.106
rc011: ib_acm_resolve_ip failed: Connection timed out
rc011: SA verification: failed Cannot assign requested address
--
rc006: Destination: 10.1.4.111
rc006: ib_acm_resolve_ip failed: Connection timed out
rc006: SA verification: failed Cannot assign requested address
====================================================================================

Do you have seen this type of problem before? In this case it should not be 
related to the ibacm_addr.cfg, right?
Maybe its a problem with the switch or links, I will try some other ports of 
the switch tomorrow.

Please find the log file, of rc011 (10.1.4.111) trying to get the information 
for rc006 (10.1.4.106), attached.
Just in case you might want to take a look at the log file.

Regards,
Jens

PS: I have a second rail running on the second port of the HCAs with a similar 
setup and I'm able to run ib_acme for all 18 nodes on the 2. rail w/o trouble.


On Mar 22, 2013, at 6:37 AM, Hefty, Sean wrote:

> Note that you can test each node separately by making the source/destination 
> addresses the same.  This may show that your first system, rc002, is working, 
> but rc003 is not.
> 
>> On the second node, the ib_acme command fails only for IPs, too. But it 
>> returns
>> with a different message ('Cannot assign requested address'):
>> ===============================================================================
>> ==
>> rc003 ~/tmp/ibacm-1.0.7 $ ib_acme -f i -s 10.0.0.52 -d 10.0.0.51 -v -P -V
>> Service: localhost
>> Destination: 10.0.0.51
>> Source: 10.0.0.52
>> ib_acm_resolve_ip failed: Cannot assign requested address
>> SA verification: failed Cannot assign requested address
>> 
>> Error Count,Resolve Count,No Data,Addr Query Count,Addr Cache Count,Route 
>> Query
>> Count,Route Cache Count
>> localhost,1,2,0,0,0,0,0
>> return status 0x0
>> 
>> rc003 ~/ $ cat /var/log/ibacm.log
>> ...
>> 1363872021.460: acm_svr_accept:
>> 1363872021.460: acm_svr_accept: assigned client 0
>> 1363872021.460: acm_server: receiving from client 0
>> 1363872021.460: acm_svr_receive: client 0
>> 1363872021.460: acm_svr_resolve_dest: client 0
>> 1363872021.460: acm_svr_resolve_dest: src  10.0.0.52
>> 1363872021.460: acm_get_ep: 10.0.0.52
>> 1363872021.460: acm_get_ep: notice - could not find 10.0.0.52
> 
> It doesn't appear that the ibacm address information is correct.  Having the 
> complete log file may help.  The assigned address configuration would end up 
> being near the top of the log file.
> 
> ibacm uses an address file, ibacm_addr.cfg, to assign address information to 
> ports.  If this file is not present, it will be created.  It's a text file, 
> and the format is hopefully straightforward to follow.  As a couple of places 
> to look , the file may be in:
> 
> /etc/rdma/ibacm_addr.cfg
> /usr/local/etc/rdma/ibacm_addr.cfg
> 
> If you find the file, the simplest thing to do may be to just remove it.  You 
> can look at the existing file to see that the correct IP address has been 
> assigned to the right port.
> 
> - Sean

Attachment: ibacm.log
Description: Binary data

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ewg mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Reply via email to