On 7/24/07, Eitan Zahavi <[EMAIL PROTECTED]> wrote:
*Hi Hal,* ** *The code to find "duplicated" GUIDs stem from real user cases where flawed * *burning procedure caused actual GUID duplications. There is nothing "impossible". *
No one said impossible; just a violation of what globally unique (GU from GUID) really means. It's largely because vendors allowed users to program non volatile RAM for GUIDs rather than a real manufacturing process for this which guarantees uniqueness that we are even discussing this aspect of it. *So it is really critical the the SM will be able to recognize this case
and abort.*
I agree with the detect part but not the abort part. Why can't it report these errors and continue on ? That seems better to me than aborting. -- Hal
*It might be that for testing someone wants to use a loopback plug that cause the same * *port GUID appear on both sides of link - but it is better to require the user doing the test * *to set some flag than to miss such a situation in real life cluster.* ** *This requirement was written after many people wasted many hours trying to figure out what was going on.* *PLEASE DO NOT TAKE IT AWAY* ** *Eitan Zahavi*** Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ------------------------------ *From:* Hal Rosenstock [mailto:[EMAIL PROTECTED] *Sent:* Tuesday, July 24, 2007 6:04 PM *To:* Eitan Zahavi *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback On 7/24/07, Eitan Zahavi <[EMAIL PROTECTED]> wrote: > > *From:* Hal Rosenstock [mailto:[EMAIL PROTECTED] ] > *Sent:* Tuesday, July 24, 2007 5:53 PM > *To:* Eitan Zahavi > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > Hi Eitan, > > On 7/24/07, Eitan Zahavi <[EMAIL PROTECTED] > wrote: > > > > *Hi Hal,* > > ** > > *What is this "loopback" connector used for?* > > *Does not seem to me like a very useful thing to do.* > > > ** > Perhaps not but no reason OpenSM can't handle this more gracefully. > > *Anyway, if it is not a production environment we could add a "debug > > mode" (-d flag option) to ignore this check.* > > > ** > Why would a separate flag be needed ? > *[EZ] Since I do not see any other solution for the SM to know it is > really a loop back plug rather then two devices with same GUID connected > back to back ... * > > "Technically", this should only occur when looped back and not two devices with same GUID as GUID == globally unique and a duplication indicates a "manufacturing" issue. Anyhow, can't these be treated the same (and handled more gracefully) without an additional option/flag ? -- Hal > -- Hal > > ** > > > > *Eitan Zahavi*** > > Senior Engineering Director, Software Architect > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > ------------------------------ > > *From:* Hal Rosenstock [mailto:[EMAIL PROTECTED] > > *Sent: *Tuesday, July 24, 2007 5:31 PM > > *To:* OpenFabrics General > > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik > > *Subject:* OpenSM detection of duplicated GUIDs on loopback > > > > > > Hi, > > > > This is what starts off as a "minor" issue and I know it has been > > discussed it somewhat in the past: > > > > Putting a loopback connector on a (switch) link causes OpenSM to > > indicate duplicated GUID error 0D18 as follows: > > > > __osm_ni_rcv_set_links > > { > > ... > > /* > > When there are only two nodes with exact same guids > > (connected back > > to back) - the previous check for duplicated guid will > > not catch > > them. But the link will be from the port to itself... > > Enhanced Port 0 is an exception to this > > */ > > if ((osm_node_get_node_guid( p_node ) == > > p_ni_context->node_guid) && > > (port_num == p_ni_context->port_num) && > > (port_num != 0)) > > { > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > "__osm_ni_rcv_set_links: ERR 0D18: " > > "Duplicate GUID found by link from a port to > > itself:" > > "node 0x%" PRIx64 ", port number 0x%X\n", > > cl_ntoh64( osm_node_get_node_guid( p_node ) ), > > port_num ); > > ... > > > > So this occurs over and over and over and fills the log with the same > > spew. This should be improved IMO. > > > > Is this really a fatal condition ? Doesn't seem like it should be to > > me. > > > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is that > > safe for this condition ? > > > > Seems like something like an extra loopback bit should be added to > > some port structure which should cause these links to be ignored. This bit > > would then be reset when the peer is now longer itself. > > > > Also, is there a relationship of this with the 12x/duplicated GUID > > code ? > > > > Thanks. > > > > -- Hal > > > > >
_______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
