On 12/15/2011 7:33 PM, Ira Weiny wrote:
> On Thu, 15 Dec 2011 14:20:28 -0800
> Hal Rosenstock <[email protected]> wrote:
>
>> On 12/15/2011 12:49 PM, Ira Weiny wrote:
>>> On Thu, 15 Dec 2011 06:15:17 -0800
>>> Hal Rosenstock <[email protected]> wrote:
>>>
>>>> On 12/14/2011 10:18 PM, Ira Weiny wrote:
>>>>>
>>>>> In addition print transaction ID of all DR PATH dumps to make sure we know
>>>>> which MAD's they refer to.
>>>>
>>>> A note on this approach is that this splits the logging of send errors
>>>> between the vendor layer and SM rather than keeping it all at one layer
>>>> of the implementation. That's the tradeoff to not fixing the bug in
>>>> umad_receiver in terms of printing the DR path in ERR 5411.
>>>
>>> Yes I guess it could be viewed this way but I really thought of it more as
>>> adding to the already existing logging in sm_mad_ctrl_send_err_cb and
>>> fixing a
>>> bug in the logging of umad_receiver.
>>>
>>> As I responded in the other thread I did not take out any logging in
>>> umad_receiver which I think is still valid.
>>
>> You left some redundant logging in though and that was what I was
>> commenting on here.
>>
>>> In addition I just added logging
>>> in the error callback regarding the request which timed out.
>>>
>>>>
>>>>> Signed-off-by: Ira Weiny <[email protected]>
>>>>> ---
>>>>> libvendor/osm_vendor_ibumad.c | 2 --
>>>>> opensm/osm_helper.c | 5 +++--
>>>>> opensm/osm_sm_mad_ctrl.c | 16 ++++++++++++++--
>>>>> 3 files changed, 17 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/libvendor/osm_vendor_ibumad.c b/libvendor/osm_vendor_ibumad.c
>>>>> index e2ebd8e..b2872c8 100644
>>>>> --- a/libvendor/osm_vendor_ibumad.c
>>>>> +++ b/libvendor/osm_vendor_ibumad.c
>>>>> @@ -348,8 +348,6 @@ static void *umad_receiver(void *p_ptr)
>>>>> ", Hop Ptr: 0x%X\n",
>>>>> mad->method, cl_ntoh16(mad->attr_id),
>>>>> cl_ntoh64(mad->trans_id), smp->hop_ptr);
>>>>> - osm_dump_smp_dr_path(p_vend->p_log, smp,
>>>>> - OSM_LOG_ERROR);
>>>>
>>>> If you're going this direction, why not remove the logging of error 5411
>>>> above it which means eliminate the else clause there ? Isn't that
>>>> redundant with your change below to sm_mad_ctrl_send_err_cb ?
>>>
>>> Technically, yes it is redundant as the "response" is not really a response.
>>> (I think.) But my intention was not to remove any logging except that which
>>> was "useless".
>>
>> If this logging is moved up to the callback, then it should be removed
>> from here IMO (and also takes care of the cancelled case too at least
>> for SMPs).
>
> Ok.
>
>>
>>>>
>>>> Also, shouldn't another related change to umad_receiver be done:
>>>>
>>>> Where it is:
>>>> if (mad->mgmt_class != IB_MCLASS_SUBN_DIR) {
>>>> it should now be:
>>>> if ((mad->mgmt_class != IB_MCLASS_SUBN_DIR) &&
>>>> (mad->mgmt_class != IB_MCLASS_SUBN_LID)) {
>>>>
>>>> to go along with SM class being logged in the SM send_err callback
>>>> rather than at umad layer.
>>>
>>> I am not sure I follow here.
>>
>> Since the callback is made for both DR and LR SMPs, the logging at the
>> vendor layer isn't needed for those. It's still needed for GMPs though
>> (like PerfMgr).
>
> The PerfMgr prints the address info in it's error call back.
I see other info there but not LID and TID being logged there.
-- Hal
> The SA however does not. :-(
>
> So I will add it there.
>
>>
>>> Why would you care about the other classes which
>>> timeout? Wouldn't they have the same issue of a response which is "fake"?
>>
>> No; the issue is only with DR path. Isn't LID fine ?
>
> Yep it would be.
>
>>
>>> If we want to remove the logging at this layer I think we should consider
>>> this.
>>>
>>> diff --git a/libvendor/osm_vendor_ibumad.c b/libvendor/osm_vendor_ibumad.c
>>>
>>> index b2872c8..b352cef 100644
>>> --- a/libvendor/osm_vendor_ibumad.c
>>> +++ b/libvendor/osm_vendor_ibumad.c
>>> @@ -327,29 +327,6 @@ static void *umad_receiver(void *p_ptr)
>>> /* if status != 0 then we are handling recv timeout on send
>>> */
>>> if (umad_status(p_madw->vend_wrap.umad)) {
>>>
>>> - if (mad->mgmt_class != IB_MCLASS_SUBN_DIR) {
>>> - /* LID routed */
>>> - OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR
>>> 5410: "
>>> - "Send completed with error --
>>> dropping\n"
>>> - "\t\t\tClass 0x%x, Method 0x%X,
>>> Attr 0x%X, "
>>> - "TID 0x%" PRIx64 ", LID %u\n",
>>> - mad->mgmt_class, mad->method,
>>> - cl_ntoh16(mad->attr_id),
>>> - cl_ntoh64(mad->trans_id),
>>> - cl_ntoh16(ib_mad_addr->lid));
>>> - } else {
>>> - ib_smp_t *smp;
>>> -
>>> - /* Direct routed SMP */
>>> - smp = (ib_smp_t *) mad;
>>> - OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR
>>> 5411: "
>>> - "DR SMP Send completed with error
>>> -- dropping\n"
>>> - "\t\t\tMethod 0x%X, Attr 0x%X, TID
>>> 0x%" PRIx64
>>> - ", Hop Ptr: 0x%X\n",
>>> - mad->method,
>>> cl_ntoh16(mad->attr_id),
>>> - cl_ntoh64(mad->trans_id),
>>> smp->hop_ptr);
>>> - }
>>> -
>>> if (!(p_req_madw = get_madw(p_vend,
>>> &mad->trans_id))) {
>>> OSM_LOG(p_vend->p_log, OSM_LOG_ERROR,
>>> "ERR 5412: "
>>>
>>>
>>> But I felt that was a bit draconian, and it was not my initial intent.
>>
>> Yes that's overkill. I think it is more like the below:
>>
>> /* if status != 0 and GMP then we are handling recv
>> timeout on send */
>> if (umad_status(p_madw->vend_wrap.umad)) {
>>
>> if ((mad->mgmt_class != IB_MCLASS_SUBN_DIR) &&
>> (mad->mgmt_class != IB_MCLASS_SUBN_LID)) {
>> /* LID routed */
>> OSM_LOG(p_vend->p_log, OSM_LOG_ERROR,
>> "ERR 5410: "
>> "Send completed with error --
>> dropping\n"
>> "\t\t\tClass 0x%x, Method 0x%X,
>> Attr 0x%X, "
>> "TID 0x%" PRIx64 ", LID %u\n",
>> mad->mgmt_class, mad->method,
>> cl_ntoh16(mad->attr_id),
>> cl_ntoh64(mad->trans_id),
>> cl_ntoh16(ib_mad_addr->lid));
>> }
>>
>> removing the else clause totally.
>
> New patch which logs this in the SA so we can make the above:
>
> OSM_LOG(p_vend->p_log, OSM_LOG_VERBOSE, "ERR 5410: "
> "Recieve Timeout on Send -- dropping "
> "TID 0x%" PRIx64 "\n",
> cl_ntoh64(mad->trans_id));
>
> Just for reference of where the call back is coming from if needed,
> Ira
>
>>
>> -- Hal
>>
>>> Ira
>>>
>>>>
>>>> -- Hal
>>>>
>>>>> }
>>>>>
>>>>> if (!(p_req_madw = get_madw(p_vend, &mad->trans_id))) {
>>>>> diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c
>>>>> index f9f3d9d..b968679 100644
>>>>> --- a/opensm/osm_helper.c
>>>>> +++ b/opensm/osm_helper.c
>>>>> @@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN
>>>>> const ib_smp_t * p_smp,
>>>>> char buf[BUF_SIZE];
>>>>> unsigned n;
>>>>>
>>>>> - n = sprintf(buf, "Received SMP on a %u hop path: "
>>>>> - "Initial path = ", p_smp->hop_count);
>>>>> + n = sprintf(buf, " DR SMP (TID 0x%" PRIx64 ") on a %u hop
>>>>> path: "
>>>>> + "Initial path = ",
>>>>> + cl_ntoh64(p_smp->trans_id), p_smp->hop_count);
>>>>> n += sprint_uint8_arr(buf + n, sizeof(buf) - n,
>>>>> p_smp->initial_path,
>>>>> p_smp->hop_count + 1);
>>>>> diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c
>>>>> index ee92c66..a3b444a 100644
>>>>> --- a/opensm/osm_sm_mad_ctrl.c
>>>>> +++ b/opensm/osm_sm_mad_ctrl.c
>>>>> @@ -704,6 +704,7 @@ Exit:
>>>>> */
>>>>> static void (IN void *context, IN osm_madw_t * p_madw)
>>>>> {
>>>>> + char lidstr[8];
>>>>> osm_sm_mad_ctrl_t *p_ctrl = context;
>>>>> ib_api_status_t status;
>>>>> ib_smp_t *p_smp;
>>>>> @@ -713,13 +714,24 @@ static void sm_mad_ctrl_send_err_cb(IN void
>>>>> *context, IN osm_madw_t * p_madw)
>>>>> CL_ASSERT(p_madw);
>>>>>
>>>>> p_smp = osm_madw_get_smp_ptr(p_madw);
>>>>> +
>>>>> + if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR)
>>>>> + lidstr[0] = '\0';
>>>>> + else
>>>>> + snprintf(lidstr, 8, " DLID %u",
>>>>> + cl_ntoh16(p_madw->mad_addr.dest_lid));
>>>>> +
>>>>> OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3113: "
>>>>> "MAD completed in error (%s): "
>>>>> - "%s(%s), attr_mod 0x%x, TID 0x%" PRIx64 "\n",
>>>>> + "%s(%s), attr_mod 0x%x, TID 0x%" PRIx64 " %s\n",
>>>>> ib_get_err_str(p_madw->status),
>>>>> ib_get_sm_method_str(p_smp->method),
>>>>> ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod),
>>>>> - cl_ntoh64(p_smp->trans_id));
>>>>> + cl_ntoh64(p_smp->trans_id),
>>>>> + lidstr);
>>>>> +
>>>>> + if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR)
>>>>> + osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
>>>>>
>>>>> /*
>>>>> If this was a SubnSet MAD, then this error might indicate a problem
>>>>
>>>
>>>
>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html