Hal Rosenstock wrote:
Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: 
Cannot find destination port with LID:0x0007

This means that the LID of the port registered as the source for this inform 
info is not recognized as a valid LID.


...
Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: 
Cannot find source port with GUID:0x0008f10403960559

The meaning of this is that the incoming trap source is not a recognized 
(included in the SM database) guid



It looks like it occurs on SM port down which seems OK.
OK that explains it:
The errors are when the SM port has turned down. In that case all the ports 
that were previously
found on the fabric are now inaccessible. The SM should Report(Notice with trap 
#65) for each of these ports.
For that sake it scans through the InformInfo database.
Apparently an InformInfo with LID=7 has requested for this report.
But LID 7 does not exist anymore - so the first message is valid:
> Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: 
Cannot find destination port with LID:0x0007
(actually this should have caused the InformInfo record to be deleted... which 
I do not think happening)

Later we see the following error:
> Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: 
Cannot find source port with GUID:0x0008f10403960559
This is sent during the section where node 0x0008f10403960559 is being teared 
off from the SMDB.

The code in osm_inform.c say:
  /* Check if there is a pkey match. o13-17.1.1*/
  /* Check if the issuer of the trap is the SM. If it is, then the pkey
     comparison should be done on the trap source (saved as the gid in the
     data details field).
     If the issuer gid is not the SM - then it is the guid of the trap
     source. */
  if ( (cl_ntoh64(p_ntc->issuer_gid.unicast.prefix) == p_subn->opt.subnet_prefix) 
&&
       (cl_ntoh64(p_ntc->issuer_gid.unicast.interface_id) == 
p_subn->sm_port_guid) )
  {
    /* The issuer is the SM this is trap 64-67 - compare the pkey
       with the gid saved on the data details. */
    source_gid = p_ntc->data_details.ntc_64_67.gid;
  }
  else
  {
    source_gid = p_ntc->issuer_gid;
  }

In our case the trap is 65 and sent by the SM. However, the spec required to 
check
the tear down port and the target of the Report will share a PKey. In out case 
the
source of the event is considered to be the port that is tear down. (As we want 
to
prevent any case where port not sharing PKey will get reports on each other).
But since the "source" port is being teared down we can not find it's PKey 
table ...
(actually we look first in the  Port by LID table - and can not find it).

This means we will never send Report(Notice trap#65) to any node.
How do we solve that bug? Maybe we have a way to find the "source" port PKey 
that
is not yet corrupted.

Here's an
extract of that portion of the log:

Sep 06 15:41:48 724961 [B76A4C40] -> __osm_state_mgr_is_sm_port_down: ]
Sep 06 15:41:48 724980 [0000] -> SM port is down.
Sep 06 15:41:48 724980 [B76A4C40] -> SM port is down.Sep 06 15:41:48 725261 
[B76A4C40] -> __osm_state_mgr_sm_port_down_msg:


******************************************************************
************************** SM PORT DOWN **************************
******************************************************************


Sep 06 15:41:48 725283 [B76A4C40] -> osm_drop_mgr_process: [
Sep 06 15:41:48 725303 [B76A4C40] -> osm_drop_mgr_process: Checking node 
0x0008f1040396040c.
Sep 06 15:41:48 725324 [B76A4C40] -> __osm_drop_mgr_process_node: [
Sep 06 15:41:48 725342 [B76A4C40] -> __osm_drop_mgr_process_node: Unreachable 
node 0x0008f1040396040c.
Sep 06 15:41:48 725364 [B76A4C40] -> __osm_drop_mgr_remove_port: [
Sep 06 15:41:48 725383 [B76A4C40] -> __osm_drop_mgr_remove_port: Unreachable 
port 0x0008f1040396040e.
Sep 06 15:41:48 725417 [B76A4C40] -> __osm_drop_mgr_remove_port: Clearing 
abandoned LID range [0x7,0x7].
Sep 06 15:41:48 725480 [B76A4C40] -> __osm_drop_mgr_remove_port: Unlinking 
local node 0x0008f1040396040c, port 0x2
                                and remote node 0x0008f10403960558, port 0x1.
Sep 06 15:41:48 725504 [B76A4C40] -> __osm_drop_mgr_remove_port: resetting 
discovery count of node: 0x0008f10403960558 port num:1.
Sep 06 15:41:48 725525 [B76A4C40] -> __osm_drop_mgr_remove_port: Clearing 
physical port number 2.
Sep 06 15:41:48 725563 [B76A4C40] -> osm_report_notice: [
Sep 06 15:41:48 725583 [B76A4C40] -> osm_report_notice: Reporting Generic 
Notice type:3 num:65 from LID:0x0003 GID:0xfe80000000000000,0x0008f10403960559
Sep 06 15:41:48 725612 [B76A4C40] -> __match_notice_to_inf_rec: [
Sep 06 15:41:48 725632 [B76A4C40] -> __match_notice_to_inf_rec: Mismatch by 
Node Type: II=0x000003 Trap=0x000004
Sep 06 15:41:48 725653 [B76A4C40] -> __match_notice_to_inf_rec: ]
Sep 06 15:41:48 725671 [B76A4C40] -> __match_notice_to_inf_rec: [
Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: 
Cannot find destination port with LID:0x0007
Sep 06 15:41:48 725710 [B76A4C40] -> __match_notice_to_inf_rec: ]
Sep 06 15:41:48 725728 [B76A4C40] -> __match_notice_to_inf_rec: [
Sep 06 15:41:48 725747 [B76A4C40] -> __match_notice_to_inf_rec: Mismatch by 
Node Type: II=0x000001 Trap=0x000004
Sep 06 15:41:48 725767 [B76A4C40] -> __match_notice_to_inf_rec: ]
Sep 06 15:41:48 725785 [B76A4C40] -> __match_notice_to_inf_rec: [
Sep 06 15:41:48 725804 [B76A4C40] -> __match_notice_to_inf_rec: Mismatch by 
Node Type: II=0x000002 Trap=0x000004
Sep 06 15:41:48 725823 [B76A4C40] -> __match_notice_to_inf_rec: ]
Sep 06 15:41:48 725843 [B76A4C40] -> osm_report_notice: ]
Sep 06 15:41:48 725862 [B76A4C40] -> Removed port with GUID:0x0008f1040396040e 
LID range [0x7,0x7] of node:Voltaire HCA400
Sep 06 15:41:48 725883 [B76A4C40] -> __osm_drop_mgr_remove_port: ]
Sep 06 15:41:48 725904 [B76A4C40] -> __osm_drop_mgr_process_node: ]
Sep 06 15:41:48 725923 [B76A4C40] -> osm_drop_mgr_process: Checking node 
0x0008f10403960558.
Sep 06 15:41:48 725943 [B76A4C40] -> osm_drop_mgr_process: Checking full 
discovery of node 0x0008f10403960558.
Sep 06 15:41:48 725964 [B76A4C40] -> osm_drop_mgr_process: Checking port 
0x0008f10403960559.
Sep 06 15:41:48 725984 [B76A4C40] -> __osm_drop_mgr_remove_port: [
Sep 06 15:41:48 726002 [B76A4C40] -> __osm_drop_mgr_remove_port: Unreachable 
port 0x0008f10403960559.
Sep 06 15:41:48 726023 [B76A4C40] -> __osm_drop_mgr_remove_port: Clearing 
abandoned LID range [0x3,0x3].
Sep 06 15:41:48 726043 [B76A4C40] -> __osm_drop_mgr_remove_port: Clearing 
physical port number 1.
Sep 06 15:41:48 726067 [B76A4C40] -> osm_report_notice: [
Sep 06 15:41:48 726086 [B76A4C40] -> osm_report_notice: Reporting Generic 
Notice type:3 num:65 from LID:0x0003 GID:0xfe80000000000000,0x0008f10403960559
Sep 06 15:41:48 726110 [B76A4C40] -> __match_notice_to_inf_rec: [
Sep 06 15:41:48 726129 [B76A4C40] -> __match_notice_to_inf_rec: Mismatch by 
Node Type: II=0x000003 Trap=0x000004
Sep 06 15:41:48 726149 [B76A4C40] -> __match_notice_to_inf_rec: ]
Sep 06 15:41:48 726167 [B76A4C40] -> __match_notice_to_inf_rec: [
Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: 
Cannot find source port with GUID:0x0008f10403960559
Sep 06 15:41:48 726206 [B76A4C40] -> __match_notice_to_inf_rec: ]
Sep 06 15:41:48 726225 [B76A4C40] -> __match_notice_to_inf_rec: [
Sep 06 15:41:48 726243 [B76A4C40] -> __match_notice_to_inf_rec: Mismatch by 
Node Type: II=0x000001 Trap=0x000004
Sep 06 15:41:48 726263 [B76A4C40] -> __match_notice_to_inf_rec: ]
Sep 06 15:41:48 726281 [B76A4C40] -> __match_notice_to_inf_rec: [
Sep 06 15:41:48 726300 [B76A4C40] -> __match_notice_to_inf_rec: Mismatch by 
Node Type: II=0x000002 Trap=0x000004
Sep 06 15:41:48 726319 [B76A4C40] -> __match_notice_to_inf_rec: ]
Sep 06 15:41:48 726339 [B76A4C40] -> osm_report_notice: ]
Sep 06 15:41:48 726357 [B76A4C40] -> Removed port with GUID:0x0008f10403960559 
LID range [0x3,0x3] of node:MT23108 InfiniHost Mellanox Technologies
Sep 06 15:41:48 726378 [B76A4C40] -> __osm_drop_mgr_remove_port: ]
Sep 06 15:41:48 726426 [B76A4C40] -> osm_drop_mgr_process: ]


Then if you can send us the log file it will help.


I'll send you the whole log offline if you still want it.
No no need to.

_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to