Eric Cheng wrote:
> On Tue, Jul 17, 2007 at 10:34:26AM +0800, Cathy Zhou wrote:
>>> the problem with this is you could always end up with dld_max_q_count
>>> (currently 16M) messages queued up. if messages are large, you may tie
>>> up lots of memory. even though TCP does flow control, there could be
>>> multiple connections sending though the same nic, so 16M messages could
>>> get queued fairly quickly. for UDP, this could happen with just one sender.
>>>
>> For UDP, it is not a problem, because UDP checks canputnext() before it
>> calls putnext().
>>
>> I understand why the message cnt checking is there today. But I see that
>> most of the legacy drivers do not have the same kind of check. They simply
>> rely on the streams flow control, which in case of TCP, that is not used.
>> So that legacy drivers (like ce) just simply does putq() to queue the
>> message, which would also cause the memory ties up.
>>
>> Another option is for dld to free the message sent from the upper layer,
>> when the dld is already queue_busy, as if the upper stream strictly follows
>> the streams flow control requirement, it should not send messages down
>> anyway. But I don't know whether that would cause performance problem for
>> TCP. I could test it.
>>
>
> it's better to not free up messages if dld if qbusy because this might
> cause tcp to rexmit and halve its send window..etc. if we have some buffering
> in dld like we do today, by the time the buffer fills up, tx resources may be
> available again in the nic. this avoids having to drop any messages.
>
I did a MAXQ TCP throughput performance over ce today, using the dld driver
which frees up messages when dsp->ds_tx_qbusy is TRUE. I tried 5000 sessions
and 1000 sessions, and the performance in both cases seem to be improved.
But it is possible:
1. the performance data is not accurate (as I only tested once.)
2. the performance improve only for certain hardware-traffic pattern
combination.
I might need to spend more time on verifying this. Do you have any
performance testing (over certain driver) in particular in mind that I
should run?
> you mentioned that the chain walking is taking a lot of time. how about if
> you calculate the mp chain size only if dld_tx_enqueue() is called by
> dld_tx_single() but not dld_wsrv()? the calculation won't take long
> because the upper layers do not pass down chains. this will make ds_tx_cnt
> not accurate but it's better than leaving the queue unbounded.
>
But the messages get piled up more quickly when dld_tx_enqueue() is called
in dld_tx_single(). In that case, it queues a *chain* of messages, while in
other cases, only one single message is queued.
Thanks
- Cathy
> eric
>
>> Thanks
>> - Cathy
>>
>>> eric
>>>
>>>
>>>> - Cathy
>>>>
>>>>> Hi,
>>>>>
>>>>> I had another look at the er_kernel data (which seems have some problem
>>>>> when showing the disassembly code, which I will ask Roch), and below is
>>>>> what I found, and I hope I can get some of your suggestions:
>>>>>
>>>>> 1. The first problem I noticed is what I've mentioned in another email,
>>>>> that a lot of CPU cycles are on the dld_tx_enqueue()->q_enable(), and
>>>>> which again causes dld_wsrv()->dld_tx_enqueue() code paths. That also
>>>>> seems to cause the hot lock dsp->ds_tx_list_lock, which is hold by
>>>>> both dld_tx_enqueue() and dld_wsrv() function.
>>>>>
>>>>> But as Eric points out, we cannot remove the q_enable() call from
>>>>> dld_tx_enqueu() function. I don't see anything we can do about the above
>>>>> issue.
>>>>>
>>>>> 2. The data also shows that the get_mpsize() functions takes the second
>>>>> most exclusive KCPU cycles. get_mpsize() function is added by UV to
>>>>> count the message size of the message chains, so that dld_tx_enqueue()
>>>>> function can determine whether or not to free the enqueued message:
>>>>>
>>>>> static void
>>>>> dld_tx_enqueue(dld_str_t *dsp, mblk_t *mp, boolean_t head_insert)
>>>>> {
>>>>> ...
>>>>> /* Calculate total size and count of the packet(s) */
>>>>> for (tail = mp, cnt = get_mpsize(mp), msgcnt = 1;
>>>>> tail->b_next != NULL; tail = tail->b_next) {
>>>>> cnt += get_mpsize(tail->b_next);
>>>>> msgcnt++;
>>>>> }
>>>>>
>>>>> mutex_enter(&dsp->ds_tx_list_lock);
>>>>> /*
>>>>> * If the queue depth would exceed the allowed threshold, drop
>>>>> * new packet(s) and drain those already in the queue.
>>>>> */
>>>>> tot_cnt = dsp->ds_tx_cnt + cnt;
>>>>> tot_msgcnt = dsp->ds_tx_msgcnt + msgcnt;
>>>>>
>>>>> if (!head_insert &&
>>>>> (tot_cnt >= dld_max_q_count || tot_msgcnt >=
>>>>> dld_max_q_count)) {
>>>>> ASSERT(dsp->ds_tx_qbusy);
>>>>> mutex_exit(&dsp->ds_tx_list_lock);
>>>>> freemsgchain(mp);
>>>>> goto done;
>>>>> }
>>>>>
>>>>> /* Update the queue size parameters */
>>>>> dsp->ds_tx_cnt = tot_cnt;
>>>>> dsp->ds_tx_msgcnt = tot_msgcnt;
>>>>> ...
>>>>> }
>>>>>
>>>>>
>>>>> My question is, whether the above message-counting-check step is needed.
>>>>> Usually, the stream layer above dld should call canputnext() before it
>>>>> calls the dld tx function, and if the dld layer is in the state that the
>>>>> messages need to be enqueued, canputnext() should return FALSE. In this
>>>>> particular test, it is the TCP TX test, and because TCP does not call
>>>>> canputnext() to check whether dld can handle more tx packets, therefore,
>>>>> it causes lots of CPU cycles spend on the dld_tx_enqueue() and
>>>>> get_mpsize() functions. But since TCP has its own flow control, I don't
>>>>> think there will be lots of messages to flood into dld queues and use up
>>>>> the memory.
>>>>>
>>>>> Thanks
>>>>> - Cathy
>>>>>
>>>>>
>>>>> _________________________________
>>>>> clearview-discuss mailing list
>>>>> clearview-discuss at opensolaris.org
>>>> _________________________________
>>>> clearview-discuss mailing list
>>>> clearview-discuss at opensolaris.org
>>>
>>> _________________________________
>>> clearview-discuss mailing list
>>> clearview-discuss at opensolaris.org