On 7/24/07, Eitan Zahavi <[EMAIL PROTECTED]> wrote:

 *Hi Hal,*
**
*The code to find "duplicated" GUIDs stem from real user cases where
flawed *
*burning procedure caused actual GUID duplications. There is nothing
"impossible". *


No one said impossible; just a violation of what globally unique (GU from
GUID) really means. It's largely because vendors allowed users to program
non volatile RAM for GUIDs rather than a real manufacturing process for this
which guarantees uniqueness that we are even discussing this aspect of it.

*So it is really critical the the SM will be able to recognize this case
and abort.*


I agree with the detect part but not the abort part. Why can't it report
these errors and continue on ? That seems better to me than aborting.

-- Hal


*It might be that for testing someone wants to use a loopback plug that
cause the same *
*port GUID appear on both sides of link - but it is better to require the
user doing the test *
*to set some flag than to miss such a situation in real life cluster.*
**
*This requirement was written after many people wasted many hours trying
to figure out what was going on.*
*PLEASE DO NOT TAKE IT AWAY*
**

*Eitan Zahavi***
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


 ------------------------------
*From:* Hal Rosenstock [mailto:[EMAIL PROTECTED]
*Sent:* Tuesday, July 24, 2007 6:04 PM
*To:* Eitan Zahavi
*Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
*Subject:* Re: OpenSM detection of duplicated GUIDs on loopback




On 7/24/07, Eitan Zahavi <[EMAIL PROTECTED]> wrote:
>
>  *From:* Hal Rosenstock [mailto:[EMAIL PROTECTED] ]
> *Sent:* Tuesday, July 24, 2007 5:53 PM
> *To:* Eitan Zahavi
> *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
>
>
>
> Hi Eitan,
>
> On 7/24/07, Eitan Zahavi <[EMAIL PROTECTED] > wrote:
> >
> >  *Hi Hal,*
> > **
> > *What is this "loopback" connector used for?*
> > *Does not seem to me like a very useful thing to do.*
> >
> **
> Perhaps not but no reason OpenSM can't handle this more gracefully.
>
>  *Anyway, if it is not a production environment we could add a "debug
> > mode" (-d flag option) to ignore this check.*
> >
> **
> Why would a separate flag be needed ?
> *[EZ] Since I do not see any other solution for the SM  to know it is
> really a loop back plug rather then two devices with same GUID connected
> back to back ... *
>
>
"Technically", this should only occur when looped back and not two devices
with same GUID as GUID == globally unique and a duplication indicates a
"manufacturing" issue.

Anyhow, can't these be treated the same (and handled more gracefully)
without an additional option/flag ?

-- Hal


> -- Hal
>
>  **
> >
> > *Eitan Zahavi***
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> >
> >
> >  ------------------------------
> > *From:* Hal Rosenstock [mailto:[EMAIL PROTECTED]
> > *Sent: *Tuesday, July 24, 2007 5:31 PM
> > *To:* OpenFabrics General
> > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik
> > *Subject:* OpenSM detection of duplicated GUIDs on loopback
> >
> >
> >  Hi,
> >
> > This is what starts off as a "minor" issue and I know it has been
> > discussed it somewhat in the past:
> >
> > Putting a loopback connector on a (switch) link causes OpenSM to
> > indicate duplicated GUID error 0D18 as follows:
> >
> > __osm_ni_rcv_set_links
> > {
> > ...
> >           /*
> >              When there are only two nodes with exact same guids
> > (connected back
> >              to back) - the previous check for duplicated guid will
> > not catch
> >              them. But the link will be from the port to itself...
> >              Enhanced Port 0 is an exception to this
> >           */
> >           if ((osm_node_get_node_guid( p_node ) ==
> > p_ni_context->node_guid) &&
> >               (port_num == p_ni_context->port_num) &&
> >               (port_num != 0))
> >           {
> >             osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> >                      "__osm_ni_rcv_set_links: ERR 0D18: "
> >                      "Duplicate GUID found by link from a port to
> > itself:"
> >                      "node 0x%" PRIx64 ", port number 0x%X\n",
> >                      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> >                      port_num );
> > ...
> >
> > So this occurs over and over and over and fills the log with the same
> > spew. This should be improved IMO.
> >
> > Is this really a fatal condition ? Doesn't seem like it should be to
> > me.
> >
> > Also, OpenSM can "ride" this out with -y (stay on fatal) but is that
> > safe for this condition ?
> >
> > Seems like something like an extra loopback bit should be added to
> > some port structure which should cause these links to be ignored. This bit
> > would then be reset when the peer is now longer itself.
> >
> > Also, is there a relationship of this with the 12x/duplicated GUID
> > code ?
> >
> > Thanks.
> >
> > -- Hal
> >
> >
>

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to