Hello Sean,
thank you very much. I found the reason, now.
The multicast rounting/support was disabled in the OpenSM configuration file:
disable_multicast TRUE
If I would have read the ibacm man page more carefully, I would have seen that
ibacm relies on multicast. My bad.
It's still a bit wired that ib_acme worked for the rest of the nodes, even
though multicast was disabled ;-)
We will try to use ibacm it on a large installation over the weekend
and I feel confident now that this will work and we can perform the benchmarks.
Thanks again,
Jens
On Mar 23, 2013, at 4:59 AM, Hefty, Sean wrote:
>> Now I have another problem with 3 out of 18 nodes. All 3 get the correct
>> information for the other 15 nodes if I run ib_acme, and also the other 15
>> can
>> obtain the right information for the 3, but if I run ib_acme among those 3
>> nodes then I get a "Connection timed out".
>> On all three nodes the command for 'localhost' does work, too.
>>
>> Here the ouput:
>> ===============================================================================
>> =====
>> rc001 ~ $ pdsh -w rc0[00-17] 'for x in `seq 100 117`; do ib_acme -f i -d
>> 10.1.4.${x} -v; done' | grep failed -B 1
>> rc002: Destination: 10.1.4.106
>> rc002: ib_acm_resolve_ip failed: Connection timed out
>> rc002: SA verification: failed Cannot assign requested address
>> --
>> rc011: Destination: 10.1.4.102
>> rc011: ib_acm_resolve_ip failed: Connection timed out
>> rc011: SA verification: failed Cannot assign requested address
>> --
>> rc006: Destination: 10.1.4.102
>> rc006: ib_acm_resolve_ip failed: Connection timed out
>> rc006: SA verification: failed Cannot assign requested address
>> --
>> rc002: Destination: 10.1.4.111
>> rc002: ib_acm_resolve_ip failed: Connection timed out
>> rc002: SA verification: failed Cannot assign requested address
>> --
>> rc011: Destination: 10.1.4.106
>> rc011: ib_acm_resolve_ip failed: Connection timed out
>> rc011: SA verification: failed Cannot assign requested address
>> --
>> rc006: Destination: 10.1.4.111
>> rc006: ib_acm_resolve_ip failed: Connection timed out
>> rc006: SA verification: failed Cannot assign requested address
>> ===============================================================================
>> =====
>>
>> Do you have seen this type of problem before? In this case it should not be
>> related to the ibacm_addr.cfg, right?
>> Maybe its a problem with the switch or links, I will try some other ports of
>> the switch tomorrow.
>
> I have not seen this problem before. The log file that you provided looks
> okay to me.
>
> The following snippet from the rc011 log file indicates that the address
> resolution message sent from rc011 is correctly being routed back to rc011.
> (rc011 simply discards the message.)
>
> 1363971114.607: acm_process_recv: base endpoint name rc011
> 1363971114.607: acm_process_acm_recv:
> 1363971114.607: acm_process_acm_recv: src 10.1.4.111
> 1363971114.607: acm_process_acm_recv: dest 10.1.4.106
> 1363971114.607: acm_process_acm_recv: unsolicited request
> 1363971114.607: acm_process_addr_req:
> 1363971114.607: acm_acquire_dest: 10.1.4.111
> 1363971114.607: acm_get_dest: 10.1.4.111
> 1363971114.607: acm_process_addr_req: dest state 4
> 1363971114.607: acm_complete_queued_req: status 0
> 1363971114.607: acm_put_dest: 10.1.4.111
>
> What would be interesting to know is if the log file on rc006 shows that it
> received the message from rc011. That is, do we see something like this:
>
> : acm_process_recv: base endpoint name rc006
> : acm_process_acm_recv:
> : acm_process_acm_recv: src 10.1.4.111
> : acm_process_acm_recv: dest 10.1.4.106
> : acm_process_acm_recv: unsolicited request
>
> It's curious that only a select group of nodes can't communicate with each
> other. I'm inclined to agree with your assessment that it may be an issue
> with the switch, or possibly how the multicast group was configured.
>
> - Sean
--------------------------------
Dipl.-Math. Jens Domke
Research Assistant
Technische Universitaet Dresden
Center for Information Services and High Performance Computing (ZIH)
Interdisciplinary Application Development and Coordination
01062 Dresden
Tel.: +49 (351) 463-39114
Fax: +49 (351) 463-37773
E-Mail: [email protected]
--------------------------------
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ewg mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
