Craig Prescott wrote:

I forgot to add that other than this
SystemGUID=0x0000000000000000 issue, the HCA appears
to work perfectly.

Thanks,
Craig

Craig Prescott wrote:

Hi;

When we run 'ibdiagnet -r' on our OFED 1.2 cluster,
it bombs with a complaint about a system guid that is
zero on our only PCI-X HCA in the fabric (see appended).
ibdiagnet seems to be trying to saw off the leading zeroes
from the system guid, and to have nothing left afterwards
seems unexpected.

Running 'ibdiagnet -r' from an OFED 1.3.1 machine does
not bomb, but I am still concerned/unclear.

My questions are: is it ok to have an HCA running
around on your fabric with a system guid of zero?
What if there was more than one?  Is there any way to
assign this HCA a sensible system guid, and would it
be useful?

The HCA in question is a Cougar cub running the 3.5.0
firmware from Mellanox.  FWIW, the node and port guids
for this HCA look sensible:

[EMAIL PROTECTED] ~]# tvflash -g
HCA #0
Node  GUID = 0005ad0000050948
Port1 GUID = 0005ad0000050949
Port2 GUID = 0005ad000005094a

If it isn't obvious already, I confess I'm not clear
about how system guids are used.  From what I can gather
from google-ing around, a system guid of zero for an HCA
means that the HCA vendor simply did not assign one.  I
am under the impression that this is uncommon, but not
unheard of.  Is that correct?

I did some searches through both volumes of the 1.2.1 IB
spec and came up empty, but I could have easily missed any
substantial discussion about system guids.  Any pointers or
enlightenment in this area would be appreciated.

Thanks,
Craig Prescott
UF HPC Center

[EMAIL PROTECTED] ~]# ibdiagnet -r
Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
Loading IBDM from: /usr/lib64/ibdm1.2
-W- Topology file is not specified.
    Reports regarding cluster links will use direct routes.
-I- Using port 1 as the local port.
-I- Discovering the subnet ... 394 nodes (46 Switches & 348 CA-s) discovered.

-I- Parsing Subnet file:/tmp/ibdiagnet.lst
-I- Defined 382/394 systems/nodes

-I---------------------------------------------------
-I- Bad Guids Info
-I---------------------------------------------------
-W- Found Device with SystemGUID=0x0000000000000000:
a HCA The Local Device "submit.ufhpc/P1" PortGUID=0x0005ad0000050949 at direct path=""
...
-I---------------------------------------------------
-I- mgid-mlid-HCAs matching table
-I---------------------------------------------------
mgid                                  | mlid   | HCAs
--------------------------------------------------------------------------------


ERROR can't use empty string as operand of "+"
    while executing
"if {([removeLeadingZeros $n] > [removeLeadingZeros $end] + 1)} {
         if {$start == $end} {
            append res "$end,"
         } else {
     ..."
    (procedure "groupNumRanges" line 15)
    invoked from within
"groupNumRanges $NEW_GROUPS($pNs)"
    (procedure "groupingEngine" line 24)
    invoked from within
"groupingEngine $groups"
    (procedure "compressNames" line 12)
    invoked from within
"compressNames $mlidHcas"
    (procedure "reportFabQualities" line 82)
    invoked from within
"reportFabQualities" can't use empty string as operand of "+"



_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Hi.

- ibdiagnet 1.2 crash when encounter a zero sys image guid ==> As you can see, this is fixed in OFED 1.3

- A system image guid is used to identify nodes that belong to the same system. For HCAs, it is purely informational. For switches, it assists the SM in some advances routing features. Bottom line - no "real" harm to the IB functionality if one or more HCAs hasv system image guid 0.

- You can set the system image guid using the mstflint tool (Mellanox firmware burning tool). However, if you used tvflash to burn the HCA firmware, it is advised to continue using tvflash (which I'm not familiar with). You can use mstflint to query the device firmware with no risk - Run "mstflint -d mthca0 q" .


Regards,
Oren.


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to