Hi Hal,
 
For many users such a critical failure (one the SM can not really do
anything with) is better aborted then forgotten in some log file.
Anyway's the -y flag lets you ignore it if you like.
 

Eitan Zahavi 
Senior Engineering Director, Software Architect 
Mellanox Technologies LTD 
Tel:+972-4-9097208
Fax:+972-4-9593245 
P.O. Box 586 Yokneam 20692 ISRAEL 

 


________________________________

        From: Hal Rosenstock [mailto:[EMAIL PROTECTED] 
        Sent: Tuesday, July 24, 2007 9:38 PM
        To: Eitan Zahavi
        Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
        Subject: Re: OpenSM detection of duplicated GUIDs on loopback
        
        


        On 7/24/07, Eitan Zahavi <[EMAIL PROTECTED]> wrote: 

                Hi Hal,
                 
                The code to find "duplicated" GUIDs stem from real user
cases where flawed 
                burning procedure caused actual GUID duplications. There
is nothing "impossible". 

         
        No one said impossible; just a violation of what globally unique
(GU from GUID) really means. It's largely because vendors allowed users
to program non volatile RAM for GUIDs rather than a real manufacturing
process for this which guarantees uniqueness that we are even discussing
this aspect of it. 


                So it is really critical the the SM will be able to
recognize this case and abort.

         
        I agree with the detect part but not the abort part. Why can't
it report these errors and continue on ? That seems better to me than
aborting.
         
        -- Hal


                 
                It might be that for testing someone wants to use a
loopback plug that cause the same 
                port GUID appear on both sides of link - but it is
better to require the user doing the test 
                to set some flag than to miss such a situation in real
life cluster.
                 
                This requirement was written after many people wasted
many hours trying to figure out what was going on.
                PLEASE DO NOT TAKE IT AWAY
                
                 

                Eitan Zahavi 
                Senior Engineering Director, Software Architect 
                Mellanox Technologies LTD 
                Tel:+972-4-9097208
                Fax:+972-4-9593245 
                P.O. Box 586 Yokneam 20692 ISRAEL 

                 


________________________________

                        From: Hal Rosenstock
[mailto:[EMAIL PROTECTED] ] 
                        Sent: Tuesday, July 24, 2007 6:04 PM 
                        
                        To: Eitan Zahavi
                        Cc: OpenFabrics General; Sasha Khapyorsky;
Yevgeny Kliteynik
                        Subject: Re: OpenSM detection of duplicated
GUIDs on loopback
                        

                         
                        


                        On 7/24/07, Eitan Zahavi <[EMAIL PROTECTED] >
wrote: 

                                From: Hal Rosenstock
[mailto:[EMAIL PROTECTED] ] 
                                Sent: Tuesday, July 24, 2007 5:53 PM
                                To: Eitan Zahavi
                                Cc: OpenFabrics General; Sasha
Khapyorsky; Yevgeny Kliteynik
                                Subject: Re: OpenSM detection of
duplicated GUIDs on loopback 
                                
                                 

                                Hi Eitan,
                                
                                
                                On 7/24/07, Eitan Zahavi
<[EMAIL PROTECTED] > wrote: 

                                Hi Hal,
                                 
                                What is this "loopback" connector used
for?
                                Does not seem to me like a very useful
thing to do.

                                 
                                Perhaps not but no reason OpenSM can't
handle this more gracefully.


                                Anyway, if it is not a production
environment we could add a "debug mode" (-d flag option) to ignore this
check.

                                 
                                Why would a separate flag be needed ?
                                [EZ] Since I do not see any other
solution for the SM  to know it is really a loop back plug rather then
two devices with same GUID connected back to back ... 

                         
                        "Technically", this should only occur when
looped back and not two devices with same GUID as GUID == globally
unique and a duplication indicates a "manufacturing" issue.
                         
                        Anyhow, can't these be treated the same (and
handled more gracefully) without an additional option/flag ?
                         
                        -- Hal


                                
                                 
                                -- Hal


                                 

                                Eitan Zahavi 
                                Senior Engineering Director, Software
Architect 
                                Mellanox Technologies LTD 
                                Tel:+972-4-9097208
                                Fax:+972-4-9593245 
                                P.O. Box 586 Yokneam 20692 ISRAEL 

                                 


________________________________

                                From: Hal Rosenstock
[mailto:[EMAIL PROTECTED] 
                                Sent: Tuesday, July 24, 2007 5:31 PM
                                To: OpenFabrics General
                                Cc: Sasha Khapyorsky; Eitan Zahavi;
Yevgeny Kliteynik
                                Subject: OpenSM detection of duplicated
GUIDs on loopback
                                
                                 
                                
                                Hi,
                                 
                                This is what starts off as a "minor"
issue and I know it has been discussed it somewhat in the past: 
                                 
                                Putting a loopback connector on a
(switch) link causes OpenSM to indicate duplicated GUID error 0D18 as
follows:
                                
                                __osm_ni_rcv_set_links
                                {
                                ...
                                          /*
                                             When there are only two
nodes with exact same guids (connected back 
                                             to back) - the previous
check for duplicated guid will not catch
                                             them. But the link will be
from the port to itself...
                                             Enhanced Port 0 is an
exception to this
                                          */ 
                                          if ((osm_node_get_node_guid(
p_node ) == p_ni_context->node_guid) &&
                                              (port_num ==
p_ni_context->port_num) &&
                                              (port_num != 0))
                                          {
                                            osm_log( p_rcv->p_log,
OSM_LOG_ERROR, 
        
"__osm_ni_rcv_set_links: ERR 0D18: "
                                                     "Duplicate GUID
found by link from a port to itself:"
                                                     "node 0x%" PRIx64
", port number 0x%X\n", 
                                                     cl_ntoh64(
osm_node_get_node_guid( p_node ) ),
                                                     port_num );
                                ...
                                
                                So this occurs over and over and over
and fills the log with the same spew. This should be improved IMO. 
                                
                                Is this really a fatal condition ?
Doesn't seem like it should be to me. 
                                 
                                Also, OpenSM can "ride" this out with -y
(stay on fatal) but is that safe for this condition ?
                                 
                                Seems like something like an extra
loopback bit should be added to some port structure which should cause
these links to be ignored. This bit would then be reset when the peer is
now longer itself. 
                                
                                Also, is there a relationship of this
with the 12x/duplicated GUID code ? 
                                 
                                Thanks.
                                 
                                -- Hal




_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to