On Thu, 15 Dec 2011 14:20:28 -0800 Hal Rosenstock <[email protected]> wrote:
> On 12/15/2011 12:49 PM, Ira Weiny wrote: > > On Thu, 15 Dec 2011 06:15:17 -0800 > > Hal Rosenstock <[email protected]> wrote: > > > >> On 12/14/2011 10:18 PM, Ira Weiny wrote: > >>> > >>> In addition print transaction ID of all DR PATH dumps to make sure we know > >>> which MAD's they refer to. > >> > >> A note on this approach is that this splits the logging of send errors > >> between the vendor layer and SM rather than keeping it all at one layer > >> of the implementation. That's the tradeoff to not fixing the bug in > >> umad_receiver in terms of printing the DR path in ERR 5411. > > > > Yes I guess it could be viewed this way but I really thought of it more as > > adding to the already existing logging in sm_mad_ctrl_send_err_cb and > > fixing a > > bug in the logging of umad_receiver. > > > > As I responded in the other thread I did not take out any logging in > > umad_receiver which I think is still valid. > > You left some redundant logging in though and that was what I was > commenting on here. > > > In addition I just added logging > > in the error callback regarding the request which timed out. > > > >> > >>> Signed-off-by: Ira Weiny <[email protected]> > >>> --- > >>> libvendor/osm_vendor_ibumad.c | 2 -- > >>> opensm/osm_helper.c | 5 +++-- > >>> opensm/osm_sm_mad_ctrl.c | 16 ++++++++++++++-- > >>> 3 files changed, 17 insertions(+), 6 deletions(-) > >>> > >>> diff --git a/libvendor/osm_vendor_ibumad.c b/libvendor/osm_vendor_ibumad.c > >>> index e2ebd8e..b2872c8 100644 > >>> --- a/libvendor/osm_vendor_ibumad.c > >>> +++ b/libvendor/osm_vendor_ibumad.c > >>> @@ -348,8 +348,6 @@ static void *umad_receiver(void *p_ptr) > >>> ", Hop Ptr: 0x%X\n", > >>> mad->method, cl_ntoh16(mad->attr_id), > >>> cl_ntoh64(mad->trans_id), smp->hop_ptr); > >>> - osm_dump_smp_dr_path(p_vend->p_log, smp, > >>> - OSM_LOG_ERROR); > >> > >> If you're going this direction, why not remove the logging of error 5411 > >> above it which means eliminate the else clause there ? Isn't that > >> redundant with your change below to sm_mad_ctrl_send_err_cb ? > > > > Technically, yes it is redundant as the "response" is not really a response. > > (I think.) But my intention was not to remove any logging except that which > > was "useless". > > If this logging is moved up to the callback, then it should be removed > from here IMO (and also takes care of the cancelled case too at least > for SMPs). Ok. > > >> > >> Also, shouldn't another related change to umad_receiver be done: > >> > >> Where it is: > >> if (mad->mgmt_class != IB_MCLASS_SUBN_DIR) { > >> it should now be: > >> if ((mad->mgmt_class != IB_MCLASS_SUBN_DIR) && > >> (mad->mgmt_class != IB_MCLASS_SUBN_LID)) { > >> > >> to go along with SM class being logged in the SM send_err callback > >> rather than at umad layer. > > > > I am not sure I follow here. > > Since the callback is made for both DR and LR SMPs, the logging at the > vendor layer isn't needed for those. It's still needed for GMPs though > (like PerfMgr). The PerfMgr prints the address info in it's error call back. The SA however does not. :-( So I will add it there. > > > Why would you care about the other classes which > > timeout? Wouldn't they have the same issue of a response which is "fake"? > > No; the issue is only with DR path. Isn't LID fine ? Yep it would be. > > > If we want to remove the logging at this layer I think we should consider > > this. > > > > diff --git a/libvendor/osm_vendor_ibumad.c b/libvendor/osm_vendor_ibumad.c > > > > index b2872c8..b352cef 100644 > > --- a/libvendor/osm_vendor_ibumad.c > > +++ b/libvendor/osm_vendor_ibumad.c > > @@ -327,29 +327,6 @@ static void *umad_receiver(void *p_ptr) > > /* if status != 0 then we are handling recv timeout on send > > */ > > if (umad_status(p_madw->vend_wrap.umad)) { > > > > - if (mad->mgmt_class != IB_MCLASS_SUBN_DIR) { > > - /* LID routed */ > > - OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR > > 5410: " > > - "Send completed with error -- > > dropping\n" > > - "\t\t\tClass 0x%x, Method 0x%X, > > Attr 0x%X, " > > - "TID 0x%" PRIx64 ", LID %u\n", > > - mad->mgmt_class, mad->method, > > - cl_ntoh16(mad->attr_id), > > - cl_ntoh64(mad->trans_id), > > - cl_ntoh16(ib_mad_addr->lid)); > > - } else { > > - ib_smp_t *smp; > > - > > - /* Direct routed SMP */ > > - smp = (ib_smp_t *) mad; > > - OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR > > 5411: " > > - "DR SMP Send completed with error > > -- dropping\n" > > - "\t\t\tMethod 0x%X, Attr 0x%X, TID > > 0x%" PRIx64 > > - ", Hop Ptr: 0x%X\n", > > - mad->method, > > cl_ntoh16(mad->attr_id), > > - cl_ntoh64(mad->trans_id), > > smp->hop_ptr); > > - } > > - > > if (!(p_req_madw = get_madw(p_vend, > > &mad->trans_id))) { > > OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, > > "ERR 5412: " > > > > > > But I felt that was a bit draconian, and it was not my initial intent. > > Yes that's overkill. I think it is more like the below: > > /* if status != 0 and GMP then we are handling recv > timeout on send */ > if (umad_status(p_madw->vend_wrap.umad)) { > > if ((mad->mgmt_class != IB_MCLASS_SUBN_DIR) && > (mad->mgmt_class != IB_MCLASS_SUBN_LID)) { > /* LID routed */ > OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, > "ERR 5410: " > "Send completed with error -- > dropping\n" > "\t\t\tClass 0x%x, Method 0x%X, > Attr 0x%X, " > "TID 0x%" PRIx64 ", LID %u\n", > mad->mgmt_class, mad->method, > cl_ntoh16(mad->attr_id), > cl_ntoh64(mad->trans_id), > cl_ntoh16(ib_mad_addr->lid)); > } > > removing the else clause totally. New patch which logs this in the SA so we can make the above: OSM_LOG(p_vend->p_log, OSM_LOG_VERBOSE, "ERR 5410: " "Recieve Timeout on Send -- dropping " "TID 0x%" PRIx64 "\n", cl_ntoh64(mad->trans_id)); Just for reference of where the call back is coming from if needed, Ira > > -- Hal > > > Ira > > > >> > >> -- Hal > >> > >>> } > >>> > >>> if (!(p_req_madw = get_madw(p_vend, &mad->trans_id))) { > >>> diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c > >>> index f9f3d9d..b968679 100644 > >>> --- a/opensm/osm_helper.c > >>> +++ b/opensm/osm_helper.c > >>> @@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN > >>> const ib_smp_t * p_smp, > >>> char buf[BUF_SIZE]; > >>> unsigned n; > >>> > >>> - n = sprintf(buf, "Received SMP on a %u hop path: " > >>> - "Initial path = ", p_smp->hop_count); > >>> + n = sprintf(buf, " DR SMP (TID 0x%" PRIx64 ") on a %u hop > >>> path: " > >>> + "Initial path = ", > >>> + cl_ntoh64(p_smp->trans_id), p_smp->hop_count); > >>> n += sprint_uint8_arr(buf + n, sizeof(buf) - n, > >>> p_smp->initial_path, > >>> p_smp->hop_count + 1); > >>> diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c > >>> index ee92c66..a3b444a 100644 > >>> --- a/opensm/osm_sm_mad_ctrl.c > >>> +++ b/opensm/osm_sm_mad_ctrl.c > >>> @@ -704,6 +704,7 @@ Exit: > >>> */ > >>> static void (IN void *context, IN osm_madw_t * p_madw) > >>> { > >>> + char lidstr[8]; > >>> osm_sm_mad_ctrl_t *p_ctrl = context; > >>> ib_api_status_t status; > >>> ib_smp_t *p_smp; > >>> @@ -713,13 +714,24 @@ static void sm_mad_ctrl_send_err_cb(IN void > >>> *context, IN osm_madw_t * p_madw) > >>> CL_ASSERT(p_madw); > >>> > >>> p_smp = osm_madw_get_smp_ptr(p_madw); > >>> + > >>> + if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) > >>> + lidstr[0] = '\0'; > >>> + else > >>> + snprintf(lidstr, 8, " DLID %u", > >>> + cl_ntoh16(p_madw->mad_addr.dest_lid)); > >>> + > >>> OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3113: " > >>> "MAD completed in error (%s): " > >>> - "%s(%s), attr_mod 0x%x, TID 0x%" PRIx64 "\n", > >>> + "%s(%s), attr_mod 0x%x, TID 0x%" PRIx64 " %s\n", > >>> ib_get_err_str(p_madw->status), > >>> ib_get_sm_method_str(p_smp->method), > >>> ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod), > >>> - cl_ntoh64(p_smp->trans_id)); > >>> + cl_ntoh64(p_smp->trans_id), > >>> + lidstr); > >>> + > >>> + if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) > >>> + osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR); > >>> > >>> /* > >>> If this was a SubnSet MAD, then this error might indicate a problem > >> > > > > > -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 [email protected] -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
