Hello Sean,

thank you very much. I found the reason, now.
The multicast rounting/support was disabled in the OpenSM configuration file:
        disable_multicast TRUE

If I would have read the ibacm man page more carefully, I would have seen that 
ibacm relies on multicast. My bad.
It's still a bit wired that ib_acme worked for the rest of the nodes, even 
though multicast was disabled ;-)

We will try to use ibacm it on a large installation over the weekend
and I feel confident now that this will work and we can perform the benchmarks.

Thanks again,
Jens


On Mar 23, 2013, at 4:59 AM, Hefty, Sean wrote:

>> Now I have another problem with 3 out of 18 nodes. All 3 get the correct
>> information for the other 15 nodes if I run ib_acme, and also the other 15 
>> can
>> obtain the right information for the 3, but if I run ib_acme among those 3
>> nodes then I get a "Connection timed out".
>> On all three nodes the command for 'localhost' does work, too.
>> 
>> Here the ouput:
>> ===============================================================================
>> =====
>> rc001 ~ $ pdsh -w rc0[00-17] 'for x in `seq 100 117`; do ib_acme -f i -d
>> 10.1.4.${x} -v; done' | grep failed -B 1
>> rc002: Destination: 10.1.4.106
>> rc002: ib_acm_resolve_ip failed: Connection timed out
>> rc002: SA verification: failed Cannot assign requested address
>> --
>> rc011: Destination: 10.1.4.102
>> rc011: ib_acm_resolve_ip failed: Connection timed out
>> rc011: SA verification: failed Cannot assign requested address
>> --
>> rc006: Destination: 10.1.4.102
>> rc006: ib_acm_resolve_ip failed: Connection timed out
>> rc006: SA verification: failed Cannot assign requested address
>> --
>> rc002: Destination: 10.1.4.111
>> rc002: ib_acm_resolve_ip failed: Connection timed out
>> rc002: SA verification: failed Cannot assign requested address
>> --
>> rc011: Destination: 10.1.4.106
>> rc011: ib_acm_resolve_ip failed: Connection timed out
>> rc011: SA verification: failed Cannot assign requested address
>> --
>> rc006: Destination: 10.1.4.111
>> rc006: ib_acm_resolve_ip failed: Connection timed out
>> rc006: SA verification: failed Cannot assign requested address
>> ===============================================================================
>> =====
>> 
>> Do you have seen this type of problem before? In this case it should not be
>> related to the ibacm_addr.cfg, right?
>> Maybe its a problem with the switch or links, I will try some other ports of
>> the switch tomorrow.
> 
> I have not seen this problem before.  The log file that you provided looks 
> okay to me.
> 
> The following  snippet from the rc011 log file indicates that the address 
> resolution message sent from rc011 is correctly being routed back to rc011.  
> (rc011 simply discards the message.)
> 
> 1363971114.607: acm_process_recv: base endpoint name rc011
> 1363971114.607: acm_process_acm_recv: 
> 1363971114.607: acm_process_acm_recv: src  10.1.4.111
> 1363971114.607: acm_process_acm_recv: dest 10.1.4.106
> 1363971114.607: acm_process_acm_recv: unsolicited request
> 1363971114.607: acm_process_addr_req: 
> 1363971114.607: acm_acquire_dest: 10.1.4.111
> 1363971114.607: acm_get_dest: 10.1.4.111
> 1363971114.607: acm_process_addr_req: dest state 4
> 1363971114.607: acm_complete_queued_req: status 0
> 1363971114.607: acm_put_dest: 10.1.4.111
> 
> What would be interesting to know is if the log file on rc006 shows that it 
> received the message from rc011.  That is, do we see something like this:
> 
> : acm_process_recv: base endpoint name rc006
> : acm_process_acm_recv: 
> : acm_process_acm_recv: src  10.1.4.111
> : acm_process_acm_recv: dest 10.1.4.106
> : acm_process_acm_recv: unsolicited request
> 
> It's curious that only a select group of nodes can't communicate with each 
> other.  I'm inclined to agree with your assessment that it may be an issue 
> with the switch, or possibly how the multicast group was configured.
> 
> - Sean

--------------------------------
Dipl.-Math. Jens Domke
Research Assistant

Technische Universitaet Dresden
Center for Information Services and High Performance Computing (ZIH)
Interdisciplinary Application Development and Coordination
01062 Dresden
Tel.: +49 (351) 463-39114
Fax: +49 (351) 463-37773
E-Mail: [email protected]
--------------------------------

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ewg mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Reply via email to