[clearview-discuss] dld_tx_enqueue() and TCP performance

Eric Cheng Tue, 17 Jul 2007 01:14:37 -0700

On Tue, Jul 17, 2007 at 10:34:26AM +0800, Cathy Zhou wrote:
> >the problem with this is you could always end up with dld_max_q_count 
> >(currently 16M) messages queued up. if messages are large, you may tie 
> >up lots of memory. even though TCP does flow control, there could be 
> >multiple connections sending though the same nic, so 16M messages could 
> >get queued fairly quickly. for UDP, this could happen with just one sender.
> >
> For UDP, it is not a problem, because UDP checks canputnext() before it 
> calls putnext().
> 
> I understand why the message cnt checking is there today. But I see that 
> most of the legacy drivers do not have the same kind of check. They simply 
> rely on the streams flow control, which in case of TCP, that is not used. 
> So that legacy drivers (like ce) just simply does putq() to queue the 
> message, which would also cause the memory ties up.
> 
> Another option is for dld to free the message sent from the upper layer, 
> when the dld is already queue_busy, as if the upper stream strictly follows 
> the streams flow control requirement, it should not send messages down 
> anyway. But I don't know whether that would cause performance problem for 
> TCP. I could test it.
>


it's better to not free up messages if dld is qbusy because this might
cause tcp to rexmit and halve its send window..etc.  if we have some buffering
in dld like we do today, by the time the buffer fills up, tx resources may be
available again in the nic. this avoids having to drop any messages.

you mentioned that the chain walking is taking a lot of time. how about if
you calculate the mp chain size only if dld_tx_enqueue() is called by
dld_tx_single() but not dld_wsrv()? the calculation won't take long
because the upper layers do not pass down chains. this will make ds_tx_cnt
not accurate but it's better than leaving the queue unbounded.

eric

> Thanks
> - Cathy
> 
> >eric
> >
> >
> >>- Cathy
> >>
> >>>Hi,
> >>>
> >>>I had another look at the er_kernel data (which seems have some problem 
> >>>when showing the disassembly code, which I will ask Roch), and below is 
> >>>what I found, and I hope I can get some of your suggestions:
> >>>
> >>>1. The first problem I noticed is what I've mentioned in another email, 
> >>>that a lot of CPU cycles are on the dld_tx_enqueue()->q_enable(), and 
> >>>which again causes dld_wsrv()->dld_tx_enqueue() code paths. That also 
> >>>seems to cause the hot lock dsp->ds_tx_list_lock, which is hold by 
> >>>  both dld_tx_enqueue() and dld_wsrv() function.
> >>>
> >>>But as Eric points out, we cannot remove the q_enable() call from 
> >>>dld_tx_enqueu() function. I don't see anything we can do about the above 
> >>>issue.
> >>>
> >>>2. The data also shows that the get_mpsize() functions takes the second 
> >>>most exclusive KCPU cycles. get_mpsize() function is added by UV to 
> >>>count the message size of the message chains, so that dld_tx_enqueue() 
> >>>function can determine whether or not to free the enqueued message:
> >>>
> >>>static void
> >>>dld_tx_enqueue(dld_str_t *dsp, mblk_t *mp, boolean_t head_insert)
> >>>{
> >>>   ...
> >>>         /* Calculate total size and count of the packet(s) */
> >>>         for (tail = mp, cnt = get_mpsize(mp), msgcnt = 1;
> >>>             tail->b_next != NULL; tail = tail->b_next) {
> >>>                 cnt += get_mpsize(tail->b_next);
> >>>                 msgcnt++;
> >>>         }
> >>>
> >>>         mutex_enter(&dsp->ds_tx_list_lock);
> >>>         /*
> >>>          * If the queue depth would exceed the allowed threshold, drop
> >>>          * new packet(s) and drain those already in the queue.
> >>>          */
> >>>         tot_cnt = dsp->ds_tx_cnt + cnt;
> >>>         tot_msgcnt = dsp->ds_tx_msgcnt + msgcnt;
> >>>
> >>>         if (!head_insert &&
> >>>             (tot_cnt >= dld_max_q_count || tot_msgcnt >= 
> >>>             dld_max_q_count)) {
> >>>                 ASSERT(dsp->ds_tx_qbusy);
> >>>                 mutex_exit(&dsp->ds_tx_list_lock);
> >>>                 freemsgchain(mp);
> >>>                 goto done;
> >>>         }
> >>>
> >>>         /* Update the queue size parameters */
> >>>         dsp->ds_tx_cnt = tot_cnt;
> >>>         dsp->ds_tx_msgcnt = tot_msgcnt;
> >>>   ...
> >>>}
> >>>
> >>>
> >>>My question is, whether the above message-counting-check step is needed. 
> >>>Usually, the stream layer above dld should call canputnext() before it 
> >>>calls the dld tx function, and if the dld layer is in the state that the 
> >>>messages need to be enqueued, canputnext() should return FALSE. In this 
> >>>particular test, it is the TCP TX test, and because TCP does not call 
> >>>canputnext() to check whether dld can handle more tx packets, therefore, 
> >>>it causes lots of CPU cycles spend on the dld_tx_enqueue() and 
> >>>get_mpsize() functions. But since TCP has its own flow control, I don't 
> >>>think there will be lots of messages to flood into dld queues and use up 
> >>>the memory.
> >>>
> >>>Thanks
> >>>- Cathy
> >>>
> >>>
> >>>_________________________________
> >>>clearview-discuss mailing list
> >>>clearview-discuss at opensolaris.org
> >>
> >>_________________________________
> >>clearview-discuss mailing list
> >>clearview-discuss at opensolaris.org
> >
> >
> >_________________________________
> >clearview-discuss mailing list
> >clearview-discuss at opensolaris.org
>

[clearview-discuss] dld_tx_enqueue() and TCP performance

Reply via email to