[clearview-discuss] dld_tx_enqueue() and TCP performance

Cathy Zhou Wed, 18 Jul 2007 00:20:46 +0800

Eric Cheng wrote:
> On Tue, Jul 17, 2007 at 10:34:26AM +0800, Cathy Zhou wrote:
>>> the problem with this is you could always end up with dld_max_q_count 
>>> (currently 16M) messages queued up. if messages are large, you may tie 
>>> up lots of memory. even though TCP does flow control, there could be 
>>> multiple connections sending though the same nic, so 16M messages could 
>>> get queued fairly quickly. for UDP, this could happen with just one sender.
>>>
>> For UDP, it is not a problem, because UDP checks canputnext() before it 
>> calls putnext().
>>
>> I understand why the message cnt checking is there today. But I see that 
>> most of the legacy drivers do not have the same kind of check. They simply 
>> rely on the streams flow control, which in case of TCP, that is not used. 
>> So that legacy drivers (like ce) just simply does putq() to queue the 
>> message, which would also cause the memory ties up.
>>
>> Another option is for dld to free the message sent from the upper layer, 
>> when the dld is already queue_busy, as if the upper stream strictly follows 
>> the streams flow control requirement, it should not send messages down 
>> anyway. But I don't know whether that would cause performance problem for 
>> TCP. I could test it.
>>
> 
> it's better to not free up messages if dld if qbusy because this might
> cause tcp to rexmit and halve its send window..etc.  if we have some buffering
> in dld like we do today, by the time the buffer fills up, tx resources may be
> available again in the nic. this avoids having to drop any messages.
> 
I did a MAXQ TCP throughput performance over ce today, using the dld driver 
which frees up messages when dsp->ds_tx_qbusy is TRUE. I tried 5000 sessions 
and 1000 sessions, and the performance in both cases seem to be improved. 
But it is possible:


1. the performance data is not accurate (as I only tested once.)
2. the performance improve only for certain hardware-traffic pattern 
combination.

I might need to spend more time on verifying this. Do you have any 
performance testing (over certain driver) in particular in mind that I 
should run?

> you mentioned that the chain walking is taking a lot of time. how about if
> you calculate the mp chain size only if dld_tx_enqueue() is called by
> dld_tx_single() but not dld_wsrv()? the calculation won't take long
> because the upper layers do not pass down chains. this will make ds_tx_cnt
> not accurate but it's better than leaving the queue unbounded.
> 
But the messages get piled up more quickly when dld_tx_enqueue() is called 
in dld_tx_single(). In that case, it queues a *chain* of messages, while in 
other cases, only one single message is queued.

Thanks
- Cathy

> eric
> 
>> Thanks
>> - Cathy
>>
>>> eric
>>>
>>>
>>>> - Cathy
>>>>
>>>>> Hi,
>>>>>
>>>>> I had another look at the er_kernel data (which seems have some problem 
>>>>> when showing the disassembly code, which I will ask Roch), and below is 
>>>>> what I found, and I hope I can get some of your suggestions:
>>>>>
>>>>> 1. The first problem I noticed is what I've mentioned in another email, 
>>>>> that a lot of CPU cycles are on the dld_tx_enqueue()->q_enable(), and 
>>>>> which again causes dld_wsrv()->dld_tx_enqueue() code paths. That also 
>>>>> seems to cause the hot lock dsp->ds_tx_list_lock, which is hold by 
>>>>>  both dld_tx_enqueue() and dld_wsrv() function.
>>>>>
>>>>> But as Eric points out, we cannot remove the q_enable() call from 
>>>>> dld_tx_enqueu() function. I don't see anything we can do about the above 
>>>>> issue.
>>>>>
>>>>> 2. The data also shows that the get_mpsize() functions takes the second 
>>>>> most exclusive KCPU cycles. get_mpsize() function is added by UV to 
>>>>> count the message size of the message chains, so that dld_tx_enqueue() 
>>>>> function can determine whether or not to free the enqueued message:
>>>>>
>>>>> static void
>>>>> dld_tx_enqueue(dld_str_t *dsp, mblk_t *mp, boolean_t head_insert)
>>>>> {
>>>>>   ...
>>>>>         /* Calculate total size and count of the packet(s) */
>>>>>         for (tail = mp, cnt = get_mpsize(mp), msgcnt = 1;
>>>>>             tail->b_next != NULL; tail = tail->b_next) {
>>>>>                 cnt += get_mpsize(tail->b_next);
>>>>>                 msgcnt++;
>>>>>         }
>>>>>
>>>>>         mutex_enter(&dsp->ds_tx_list_lock);
>>>>>         /*
>>>>>          * If the queue depth would exceed the allowed threshold, drop
>>>>>          * new packet(s) and drain those already in the queue.
>>>>>          */
>>>>>         tot_cnt = dsp->ds_tx_cnt + cnt;
>>>>>         tot_msgcnt = dsp->ds_tx_msgcnt + msgcnt;
>>>>>
>>>>>         if (!head_insert &&
>>>>>             (tot_cnt >= dld_max_q_count || tot_msgcnt >= 
>>>>>             dld_max_q_count)) {
>>>>>                 ASSERT(dsp->ds_tx_qbusy);
>>>>>                 mutex_exit(&dsp->ds_tx_list_lock);
>>>>>                 freemsgchain(mp);
>>>>>                 goto done;
>>>>>         }
>>>>>
>>>>>         /* Update the queue size parameters */
>>>>>         dsp->ds_tx_cnt = tot_cnt;
>>>>>         dsp->ds_tx_msgcnt = tot_msgcnt;
>>>>>   ...
>>>>> }
>>>>>
>>>>>
>>>>> My question is, whether the above message-counting-check step is needed. 
>>>>> Usually, the stream layer above dld should call canputnext() before it 
>>>>> calls the dld tx function, and if the dld layer is in the state that the 
>>>>> messages need to be enqueued, canputnext() should return FALSE. In this 
>>>>> particular test, it is the TCP TX test, and because TCP does not call 
>>>>> canputnext() to check whether dld can handle more tx packets, therefore, 
>>>>> it causes lots of CPU cycles spend on the dld_tx_enqueue() and 
>>>>> get_mpsize() functions. But since TCP has its own flow control, I don't 
>>>>> think there will be lots of messages to flood into dld queues and use up 
>>>>> the memory.
>>>>>
>>>>> Thanks
>>>>> - Cathy
>>>>>
>>>>>
>>>>> _________________________________
>>>>> clearview-discuss mailing list
>>>>> clearview-discuss at opensolaris.org
>>>> _________________________________
>>>> clearview-discuss mailing list
>>>> clearview-discuss at opensolaris.org
>>>
>>> _________________________________
>>> clearview-discuss mailing list
>>> clearview-discuss at opensolaris.org

[clearview-discuss] dld_tx_enqueue() and TCP performance

Reply via email to