On Fri, 2005-09-09 at 13:22, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > On Thu, 2005-09-08 at 09:02, Eitan Zahavi wrote:
> > 
> >>Hal Rosenstock wrote:
> >>
> >>>>>>>Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 
> >>>>>>>0207: Cannot find destination port with LID:0x0007
> >>>>>>
> >>>>>>This means that the LID of the port registered as the source for this 
> >>>>>>inform info is not recognized as a valid LID.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>...
> >>>>>>>Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 
> >>>>>>>0207: Cannot find source port with GUID:0x0008f10403960559
> >>>>>>
> >>>>>>The meaning of this is that the incoming trap source is not a 
> >>>>>>recognized (included in the SM database) guid
> >>>>>
> >>>>>
> > 
> >>>It looks like it occurs on SM port down which seems OK. 
> >>
> >>OK that explains it:
> >>The errors are when the SM port has turned down. In that case all the ports 
> >>that were previously
> >>found on the fabric are now inaccessible. The SM should Report(Notice with 
> >>trap #65) for each of these ports.
> > 
> > 
> > Right, GID out of service should be and is indicated.
> > 
> > 
> >>For that sake it scans through the InformInfo database.
> >>Apparently an InformInfo with LID=7 has requested for this report.
> >>But LID 7 does not exist anymore
> > 
> > 
> > It exists. It is just not reachable via GS (SA) LID routed packets.
> Well from the point of view of the SM it does not once the SM can not reach 
> it.

OK.
 
> >> - so the first message is valid:
> > 
> > 
> > Not sure what you mean exactly by valid here.
> Valid means that it is correct. The destination port to send the Report to is 
> not part of any partition any more.
> I would rephrase the error message and make it Info. There is no ERROR in 
> loosing some ports.

Right. This should be made into something less than error.
 
> >> > Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 
> >> > 0207: Cannot find destination port with LID:0x0007
> >>(actually this should have caused the InformInfo record to be deleted... 
> >>which I do not think happening)
> > 
> > What should have caused the InformInfo record to be deleted ? 
> "o13-17.1.2: If a Set(InformInfo) specified a valid trap source at the time of
> subscription (see o13-14.1.1: on page 746), yet Trap() forwarding fails 
> because
> the subscriber and trap source are no longer permitted to access
> each other according to current partitioning (see o13-17.1.1: on page
> 747), then the manager shall permanently discontinue all event forwarding
> caused by the Set(InformInfo) which created a subscription to
> that trap source, except if InformInfo:LIDRangeBegin was 0xFFFF; in the
> latter case, event forwarding is discontinued only for the now-invalid trap
> source."
> Later on the same page:
> "Note also that “permanently discontinue all event forwarding” is meant to
> indicate that the subscription for forwarding is dropped by the manager; if
> the source later becomes reachable again by the subscriber, a new
> Set(InformInfo) is required to re-establish event forwarding, if that is what
> is desired. (This may not be desired; when the source becomes reachable
> again, it may have acquired new characteristics, such as new, different
> software functions, that make such forwarding inappropriate.)"
> 
> > This error being detected ? 
> Not currently
> > If so, should it wait for the error or should it occur
> > when the SM port goes down do this (clear the inform list perhaps with
> > the exception of the local node) ? 
> Maybe or just code the generic code to handle 013-17.1.2
> > That would require/mean
> > reregistration is required when the node comes back. SA clients won't
> > necessarily do this when the SM port comes back without something like
> > ClientReregistration.
> Correct. This is another reason why ClientReRegistration is an important 
> feature of the
> access layer.

I would have ended that sentence after feature. It does not need to be
implemented in the access layer.

> >>Later we see the following error:
> >> > Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 
> >> > 0207: Cannot find source port with GUID:0x0008f10403960559
> >>This is sent during the section where node 0x0008f10403960559 is being 
> >>teared off from the SMDB.
> >>
> >>The code in osm_inform.c say:
> >>   /* Check if there is a pkey match. o13-17.1.1*/
> > 
> > 
> > Where is this performed ?
> osm_inform.c
> __match_notice_to_inf_rec
> 
> > 
> > 
> >>   /* Check if the issuer of the trap is the SM. If it is, then the pkey
> > 
> >                                                                       ^^
> >                                                                      gid
> The requirement is to have a shared PKey according to PKey sharing rules 
> between the
> InformInfo requester and the Trap generator. However, in the case of traps 
> 64-67
> the SM is the Trap generator. So we need the spacial logic below to obtain 
> the port gid
> that the trap refers to from within the notice data details fields and not 
> from the issuer field.

I think the comment in the code is wrong here and should be gid rather
than pkey. I do agree that the pkey sharing needs checking but that is
separate.
 
> >>      comparison should be done on the trap source (saved as the gid in the
> >>      data details field).
> >>      If the issuer gid is not the SM - then it is the guid of the trap
> >>      source. */
> >>   if ( (cl_ntoh64(p_ntc->issuer_gid.unicast.prefix) == 
> >> p_subn->opt.subnet_prefix) &&
> >>        (cl_ntoh64(p_ntc->issuer_gid.unicast.interface_id) == 
> >> p_subn->sm_port_guid) )
> >>   {
> >>     /* The issuer is the SM this is trap 64-67 - compare the pkey
> >>        with the gid saved on the data details. */
> >>     source_gid = p_ntc->data_details.ntc_64_67.gid;
> >>   }
> >>   else
> >>   {
> >>     source_gid = p_ntc->issuer_gid;
> >>   }
> >>
> >>In our case the trap is 65 and sent by the SM. However, the spec required 
> >>to check
> >>the tear down port and the target of the Report will share a PKey.
> > 
> > 
> > I'm not sure what you are referring to in the spec. In any case,
> > shouldn't the local ports perhaps be an exception to this ?
> I do not think so. The requirement make sense for all traps:
> If the Trap describes a port A then it should not be forwarded to another 
> port B unless they
> share a PKey:
> "o13-17.1.1: Managers that support event forwarding and have confirmed
> a request for event subscription shall forward corresponding events to the
> subscriber using a Report(Notice) MAD, as long as the subscriber and
> Trap() source are permitted to access each other according to current 
> partitioning."
> > 
> > 
> >> In out case the
> >>source of the event is considered to be the port that is tear down. (As we 
> >>want to
> >>prevent any case where port not sharing PKey will get reports on each 
> >>other).
> >>But since the "source" port is being teared down we can not find it's PKey 
> >>table ...
> >>(actually we look first in the  Port by LID table - and can not find it).
> >>
> >>This means we will never send Report(Notice trap#65) to any node.
> >>How do we solve that bug? Maybe we have a way to find the "source" port 
> >>PKey that
> >>is not yet corrupted.
> > 
> > 
> > I'm not totally following this because of the PKey v. GID issue above and 
> > I think local ports may be (needed to be) treated differently.
> I hope the above 17.1.1 convinced you. The GID vs PKey is just unclear 
> documentation.
> The idea is that for trap# 64-67 which are generated by the SM you can not 
> simply use the SM PKey but
> lookup the gid of the reported port from within the notice data details and 
> then lookup that port PKey.

OK. I'm convinced.

I'm still not sure what is the bug you are referring to above though.

-- Hal

_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to