On 12/15/2011 12:40 PM, Ira Weiny wrote:
> On Thu, 15 Dec 2011 06:15:09 -0800
> Hal Rosenstock <[email protected]> wrote:
>
>> On 12/14/2011 10:18 PM, Ira Weiny wrote:
>>> On Wed, 14 Dec 2011 18:21:28 -0800
>>> Hal Rosenstock <[email protected]> wrote:
>>>
>>>> On 12/14/2011 12:09 AM, Ira Weiny wrote:
>>>>>
>>>>> In addition print transaction ID of all DR PATH dumps to make sure we know
>>>>> which MAD's they refer to.
>>>>>
>>>>> Signed-off-by: Ira Weiny <[email protected]>
>>>>> ---
>>>>> opensm/osm_helper.c | 5 +++--
>>>>> opensm/osm_sm_mad_ctrl.c | 7 +++++++
>>>>> 2 files changed, 10 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c
>>>>> index f9f3d9d..b6591c4 100644
>>>>> --- a/opensm/osm_helper.c
>>>>> +++ b/opensm/osm_helper.c
>>>>> @@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN
>>>>> const ib_smp_t * p_smp,
>>>>> char buf[BUF_SIZE];
>>>>> unsigned n;
>>>>>
>>>>> - n = sprintf(buf, "Received SMP on a %u hop path: "
>>>>> - "Initial path = ", p_smp->hop_count);
>>>>> + n = sprintf(buf, "Received SMP (TID 0x%" PRIx64 ") on a %u hop
>>>>> path: "
>>>>> + "Initial path = ",
>>>>> + cl_ntoh64(p_smp->trans_id), p_smp->hop_count);
>>>>> n += sprint_uint8_arr(buf + n, sizeof(buf) - n,
>>>>> p_smp->initial_path,
>>>>> p_smp->hop_count + 1);
>>>>> diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c
>>>>> index ee92c66..6abf8b8 100644
>>>>> --- a/opensm/osm_sm_mad_ctrl.c
>>>>> +++ b/opensm/osm_sm_mad_ctrl.c
>>>>> @@ -721,6 +721,13 @@ static void sm_mad_ctrl_send_err_cb(IN void
>>>>> *context, IN osm_madw_t * p_madw)
>>>>> ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod),
>>>>> cl_ntoh64(p_smp->trans_id));
>>>>>
>>>>> + if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) {
>>>>> + osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
>>>>
>>>> Rather than here, should this be in osm_vendor_ibumad.c ? There's
>>>> already one similar log there but looks like evicted entry logging was
>>>> not done. If not, then do any logs there need to be removed as redundant ?
>>>
>>> Yes looking a bit closer I see that is redundant with the current
>>> umad_status
>>> implementation.
>>
>> Well, not quite. That logs on any send error and osm_dump_smp_dr_path is
>> only currently called for NodeInfo. That's one reason why it's down at
>> that layer right now. I can see your v2 patch addresses this.
>>
>>> IE the message you get is:
>>>
>>> Dec 14 18:31:54 137584 [AEB0C700] 0x01 -> Received SMP on a 4 hop path:
>>> Initial path = 0,0,0,0,0, Return path = 0,0,0,0,0
>>>
>>> That is useless. I can alter the patch to remove that as well.
>>
>> Another alternative is that if it's a bug, why not just fix that ?
>
> I was not sure if it was a bug (but I did not dig through the umad and kernel
> layers to verify that). It looks to me like umad_status only returns TIMEOUT
> as an error. Furthermore, I think the "response" MAD returned (and printed in
> the above logs) does not include a DR path since it is not really a response
> but just a place holder.
I thought that the sent MAD is returned in the timeout case. That would
include the DR paths in the SMP case. It seems to work only some of the
time though and I'm not sure why that is.
> Does the kernel/umad layer copy the request MAD into the "response" MAD on
> timeout?[*]
See above and below under [*]
> (It looks to me like only some of the data is copied into the
> "response" MAD; method, attr_id, and transaction ID).
That's certainly the observation in some but not all cases...
> If the entire request
> is to be copied then the above might be bug.
I thought the request MAD is supposed to be returned on a send error
(including timeout). Maybe some shortcut/optimization was taken here
in some cases.
> If not then the print statement
> I removed is useless but the other one at least confirms the "fake" MAD is
> equivalent to the one printed in the error call back.
Yes, but why have the extra log message just for that ?
> If you want to keep the error printing at the vendor layer then I guess the
> appropriate place to print the error would be on the request mad like this...
>
> } else {
> p_req_madw->status = IB_TIMEOUT;
> /* cb frees req_madw */
> pthread_mutex_lock(&p_vend->cb_mutex);
> pthread_cleanup_push(unlock_mutex,
> &p_vend->cb_mutex);
> // dump p_req_madw here before callback
> (*p_bind->send_err_callback) (p_bind->
> client_context,
> p_req_madw);
> pthread_cleanup_pop(1);
> }
There's also the cancelled case too that calls send_err_callback.
> Or you are correct we could just remove all the logging in the vendor layer
> here.
>
> Alex, any opinions?
>
> Ira
>
> [*] I don't think the request MAD should be copied except for the TID as is
> documented so you can find the request MAD.
What documentation are you referring to ?
-- Hal
>>
>>>>
>>>> Also, does this log every timeout (at error level) ? If so, that might
>>>> not be a good thing in all subnets as timeouts are common.
>>>
>>> Why would you say that? I think it is very valid to know what nodes are
>>> timeing out. When would you not want to know the destination of what is
>>> timing out?
>>
>> Yes, but I'm concerned about spamming the log. Timeouts are "normal" in
>> many subnets; maybe not yours.
>>
>> I was just saying level might be reduced but I can see 5411 is error
>> level too.
>>
>>>>
>>>>> + } else {
>>>>> + OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "LID %u\n",
>>>>> + cl_ntoh16(p_madw->mad_addr.dest_lid));
>>>>
>>>> Log the TID here too ?
>>>
>>> Actually I think moving that into the error print above is better.
>>
>> Sure; that's another way of accomplishing the same thing.
>>
>> -- Hal
>>
>>> Sending V2 now,
>>> Ira
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html