Hi,

I had another look at the er_kernel data (which seems have some problem when 
showing the disassembly code, which I will ask Roch), and below is what I 
found, and I hope I can get some of your suggestions:

1. The first problem I noticed is what I've mentioned in another email, that 
a lot of CPU cycles are on the dld_tx_enqueue()->q_enable(), and which again 
causes dld_wsrv()->dld_tx_enqueue() code paths. That also seems to cause the 
  hot lock dsp->ds_tx_list_lock, which is hold by both dld_tx_enqueue() and 
dld_wsrv() function.

But as Eric points out, we cannot remove the q_enable() call from 
dld_tx_enqueu() function. I don't see anything we can do about the above issue.

2. The data also shows that the get_mpsize() functions takes the second most 
exclusive KCPU cycles. get_mpsize() function is added by UV to count the 
message size of the message chains, so that dld_tx_enqueue() function can 
determine whether or not to free the enqueued message:

static void
dld_tx_enqueue(dld_str_t *dsp, mblk_t *mp, boolean_t head_insert)
{
        ...
         /* Calculate total size and count of the packet(s) */
         for (tail = mp, cnt = get_mpsize(mp), msgcnt = 1;
             tail->b_next != NULL; tail = tail->b_next) {
                 cnt += get_mpsize(tail->b_next);
                 msgcnt++;
         }

         mutex_enter(&dsp->ds_tx_list_lock);
         /*
          * If the queue depth would exceed the allowed threshold, drop
          * new packet(s) and drain those already in the queue.
          */
         tot_cnt = dsp->ds_tx_cnt + cnt;
         tot_msgcnt = dsp->ds_tx_msgcnt + msgcnt;

         if (!head_insert &&
             (tot_cnt >= dld_max_q_count || tot_msgcnt >= dld_max_q_count)) {
                 ASSERT(dsp->ds_tx_qbusy);
                 mutex_exit(&dsp->ds_tx_list_lock);
                 freemsgchain(mp);
                 goto done;
         }

         /* Update the queue size parameters */
         dsp->ds_tx_cnt = tot_cnt;
         dsp->ds_tx_msgcnt = tot_msgcnt;
        ...
}


My question is, whether the above message-counting-check step is needed. 
Usually, the stream layer above dld should call canputnext() before it calls 
the dld tx function, and if the dld layer is in the state that the messages 
need to be enqueued, canputnext() should return FALSE. In this particular 
test, it is the TCP TX test, and because TCP does not call canputnext() to 
check whether dld can handle more tx packets, therefore, it causes lots of 
CPU cycles spend on the dld_tx_enqueue() and get_mpsize() functions. But 
since TCP has its own flow control, I don't think there will be lots of 
messages to flood into dld queues and use up the memory.

Thanks
- Cathy


Reply via email to