On Fri, 2005-09-09 at 13:22, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > On Thu, 2005-09-08 at 09:02, Eitan Zahavi wrote:
> >
> >>Hal Rosenstock wrote:
> >>
> >>>>>>>Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR
> >>>>>>>0207: Cannot find destination port with LID:0x0007
> >>>>>>
> >>>>>>This means that the LID of the port registered as the source for this
> >>>>>>inform info is not recognized as a valid LID.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>...
> >>>>>>>Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR
> >>>>>>>0207: Cannot find source port with GUID:0x0008f10403960559
> >>>>>>
> >>>>>>The meaning of this is that the incoming trap source is not a
> >>>>>>recognized (included in the SM database) guid
> >>>>>
> >>>>>
> >
> >>>It looks like it occurs on SM port down which seems OK.
> >>
> >>OK that explains it:
> >>The errors are when the SM port has turned down. In that case all the ports
> >>that were previously
> >>found on the fabric are now inaccessible. The SM should Report(Notice with
> >>trap #65) for each of these ports.
> >
> >
> > Right, GID out of service should be and is indicated.
> >
> >
> >>For that sake it scans through the InformInfo database.
> >>Apparently an InformInfo with LID=7 has requested for this report.
> >>But LID 7 does not exist anymore
> >
> >
> > It exists. It is just not reachable via GS (SA) LID routed packets.
> Well from the point of view of the SM it does not once the SM can not reach
> it.
OK.
> >> - so the first message is valid:
> >
> >
> > Not sure what you mean exactly by valid here.
> Valid means that it is correct. The destination port to send the Report to is
> not part of any partition any more.
> I would rephrase the error message and make it Info. There is no ERROR in
> loosing some ports.
Right. This should be made into something less than error.
> >> > Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR
> >> > 0207: Cannot find destination port with LID:0x0007
> >>(actually this should have caused the InformInfo record to be deleted...
> >>which I do not think happening)
> >
> > What should have caused the InformInfo record to be deleted ?
> "o13-17.1.2: If a Set(InformInfo) specified a valid trap source at the time of
> subscription (see o13-14.1.1: on page 746), yet Trap() forwarding fails
> because
> the subscriber and trap source are no longer permitted to access
> each other according to current partitioning (see o13-17.1.1: on page
> 747), then the manager shall permanently discontinue all event forwarding
> caused by the Set(InformInfo) which created a subscription to
> that trap source, except if InformInfo:LIDRangeBegin was 0xFFFF; in the
> latter case, event forwarding is discontinued only for the now-invalid trap
> source."
> Later on the same page:
> "Note also that “permanently discontinue all event forwarding” is meant to
> indicate that the subscription for forwarding is dropped by the manager; if
> the source later becomes reachable again by the subscriber, a new
> Set(InformInfo) is required to re-establish event forwarding, if that is what
> is desired. (This may not be desired; when the source becomes reachable
> again, it may have acquired new characteristics, such as new, different
> software functions, that make such forwarding inappropriate.)"
>
> > This error being detected ?
> Not currently
> > If so, should it wait for the error or should it occur
> > when the SM port goes down do this (clear the inform list perhaps with
> > the exception of the local node) ?
> Maybe or just code the generic code to handle 013-17.1.2
> > That would require/mean
> > reregistration is required when the node comes back. SA clients won't
> > necessarily do this when the SM port comes back without something like
> > ClientReregistration.
> Correct. This is another reason why ClientReRegistration is an important
> feature of the
> access layer.
I would have ended that sentence after feature. It does not need to be
implemented in the access layer.
> >>Later we see the following error:
> >> > Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR
> >> > 0207: Cannot find source port with GUID:0x0008f10403960559
> >>This is sent during the section where node 0x0008f10403960559 is being
> >>teared off from the SMDB.
> >>
> >>The code in osm_inform.c say:
> >> /* Check if there is a pkey match. o13-17.1.1*/
> >
> >
> > Where is this performed ?
> osm_inform.c
> __match_notice_to_inf_rec
>
> >
> >
> >> /* Check if the issuer of the trap is the SM. If it is, then the pkey
> >
> > ^^
> > gid
> The requirement is to have a shared PKey according to PKey sharing rules
> between the
> InformInfo requester and the Trap generator. However, in the case of traps
> 64-67
> the SM is the Trap generator. So we need the spacial logic below to obtain
> the port gid
> that the trap refers to from within the notice data details fields and not
> from the issuer field.
I think the comment in the code is wrong here and should be gid rather
than pkey. I do agree that the pkey sharing needs checking but that is
separate.
> >> comparison should be done on the trap source (saved as the gid in the
> >> data details field).
> >> If the issuer gid is not the SM - then it is the guid of the trap
> >> source. */
> >> if ( (cl_ntoh64(p_ntc->issuer_gid.unicast.prefix) ==
> >> p_subn->opt.subnet_prefix) &&
> >> (cl_ntoh64(p_ntc->issuer_gid.unicast.interface_id) ==
> >> p_subn->sm_port_guid) )
> >> {
> >> /* The issuer is the SM this is trap 64-67 - compare the pkey
> >> with the gid saved on the data details. */
> >> source_gid = p_ntc->data_details.ntc_64_67.gid;
> >> }
> >> else
> >> {
> >> source_gid = p_ntc->issuer_gid;
> >> }
> >>
> >>In our case the trap is 65 and sent by the SM. However, the spec required
> >>to check
> >>the tear down port and the target of the Report will share a PKey.
> >
> >
> > I'm not sure what you are referring to in the spec. In any case,
> > shouldn't the local ports perhaps be an exception to this ?
> I do not think so. The requirement make sense for all traps:
> If the Trap describes a port A then it should not be forwarded to another
> port B unless they
> share a PKey:
> "o13-17.1.1: Managers that support event forwarding and have confirmed
> a request for event subscription shall forward corresponding events to the
> subscriber using a Report(Notice) MAD, as long as the subscriber and
> Trap() source are permitted to access each other according to current
> partitioning."
> >
> >
> >> In out case the
> >>source of the event is considered to be the port that is tear down. (As we
> >>want to
> >>prevent any case where port not sharing PKey will get reports on each
> >>other).
> >>But since the "source" port is being teared down we can not find it's PKey
> >>table ...
> >>(actually we look first in the Port by LID table - and can not find it).
> >>
> >>This means we will never send Report(Notice trap#65) to any node.
> >>How do we solve that bug? Maybe we have a way to find the "source" port
> >>PKey that
> >>is not yet corrupted.
> >
> >
> > I'm not totally following this because of the PKey v. GID issue above and
> > I think local ports may be (needed to be) treated differently.
> I hope the above 17.1.1 convinced you. The GID vs PKey is just unclear
> documentation.
> The idea is that for trap# 64-67 which are generated by the SM you can not
> simply use the SM PKey but
> lookup the gid of the reported port from within the notice data details and
> then lookup that port PKey.
OK. I'm convinced.
I'm still not sure what is the bug you are referring to above though.
-- Hal
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general