On Thu, 15 Dec 2011 06:15:09 -0800
Hal Rosenstock <[email protected]> wrote:

> On 12/14/2011 10:18 PM, Ira Weiny wrote:
> > On Wed, 14 Dec 2011 18:21:28 -0800
> > Hal Rosenstock <[email protected]> wrote:
> > 
> >> On 12/14/2011 12:09 AM, Ira Weiny wrote:
> >>>
> >>> In addition print transaction ID of all DR PATH dumps to make sure we know
> >>> which MAD's they refer to.
> >>>
> >>> Signed-off-by: Ira Weiny <[email protected]>
> >>> ---
> >>>  opensm/osm_helper.c      |    5 +++--
> >>>  opensm/osm_sm_mad_ctrl.c |    7 +++++++
> >>>  2 files changed, 10 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c
> >>> index f9f3d9d..b6591c4 100644
> >>> --- a/opensm/osm_helper.c
> >>> +++ b/opensm/osm_helper.c
> >>> @@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN 
> >>> const ib_smp_t * p_smp,
> >>>           char buf[BUF_SIZE];
> >>>           unsigned n;
> >>>  
> >>> -         n = sprintf(buf, "Received SMP on a %u hop path: "
> >>> -                     "Initial path = ", p_smp->hop_count);
> >>> +         n = sprintf(buf, "Received SMP (TID 0x%" PRIx64 ") on a %u hop 
> >>> path: "
> >>> +                     "Initial path = ",
> >>> +                     cl_ntoh64(p_smp->trans_id), p_smp->hop_count);
> >>>           n += sprint_uint8_arr(buf + n, sizeof(buf) - n,
> >>>                                 p_smp->initial_path,
> >>>                                 p_smp->hop_count + 1);
> >>> diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c
> >>> index ee92c66..6abf8b8 100644
> >>> --- a/opensm/osm_sm_mad_ctrl.c
> >>> +++ b/opensm/osm_sm_mad_ctrl.c
> >>> @@ -721,6 +721,13 @@ static void sm_mad_ctrl_send_err_cb(IN void 
> >>> *context, IN osm_madw_t * p_madw)
> >>>           ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod),
> >>>           cl_ntoh64(p_smp->trans_id));
> >>>  
> >>> + if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) {
> >>> +         osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
> >>
> >> Rather than here, should this be in osm_vendor_ibumad.c ? There's
> >> already one similar log there but looks like evicted entry logging was
> >> not done. If not, then do any logs there need to be removed as redundant ?
> > 
> > Yes looking a bit closer I see that is redundant with the current 
> > umad_status
> > implementation.  
> 
> Well, not quite. That logs on any send error and osm_dump_smp_dr_path is
> only currently called for NodeInfo. That's one reason why it's down at
> that layer right now. I can see your v2 patch addresses this.
> 
> > IE the message you get is:
> > 
> >     Dec 14 18:31:54 137584 [AEB0C700] 0x01 -> Received SMP on a 4 hop path: 
> > Initial path = 0,0,0,0,0, Return path  = 0,0,0,0,0
> > 
> > That is useless.  I can alter the patch to remove that as well.
> 
> Another alternative is that if it's a bug, why not just fix that ?

I was not sure if it was a bug (but I did not dig through the umad and kernel
layers to verify that).  It looks to me like umad_status only returns TIMEOUT
as an error.  Furthermore, I think the "response" MAD returned (and printed in
the above logs) does not include a DR path since it is not really a response
but just a place holder.

Does the kernel/umad layer copy the request MAD into the "response" MAD on
timeout?[*]  (It looks to me like only some of the data is copied into the
"response" MAD; method, attr_id, and transaction ID).  If the entire request
is to be copied then the above might be bug.  If not then the print statement
I removed is useless but the other one at least confirms the "fake" MAD is
equivalent to the one printed in the error call back.

If you want to keep the error printing at the vendor layer then I guess the
appropriate place to print the error would be on the request mad like this...

                        } else {
                                p_req_madw->status = IB_TIMEOUT;
                                /* cb frees req_madw */
                                pthread_mutex_lock(&p_vend->cb_mutex);
                                pthread_cleanup_push(unlock_mutex,
                                                     &p_vend->cb_mutex);
// dump p_req_madw here before callback
                                (*p_bind->send_err_callback) (p_bind->
                                                              client_context,
                                                              p_req_madw);
                                pthread_cleanup_pop(1);
                        }

Or you are correct we could just remove all the logging in the vendor layer
here.  Alex, any opinions?

Ira

[*] I don't think the request MAD should be copied except for the TID as is
documented so you can find the request MAD.

> 
> >>
> >> Also, does this log every timeout (at error level) ? If so, that might
> >> not be a good thing in all subnets as timeouts are common.
> > 
> > Why would you say that?  I think it is very valid to know what nodes are
> > timeing out.  When would you not want to know the destination of what is
> > timing out?
> 
> Yes, but I'm concerned about spamming the log. Timeouts are "normal" in
> many subnets; maybe not yours.
> 
> I was just saying level might be reduced but I can see 5411 is error
> level too.
> 
> >>
> >>> + } else {
> >>> +         OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "LID %u\n",
> >>> +                 cl_ntoh16(p_madw->mad_addr.dest_lid));
> >>
> >> Log the TID here too ?
> > 
> > Actually I think moving that into the error print above is better.
> 
> Sure; that's another way of accomplishing the same thing.
> 
> -- Hal
> 
> > Sending V2 now,
> > Ira
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to [email protected]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
[email protected]
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to