Nicolas Droux wrote:
> 
> On May 14, 2009, at 10:51 AM, Andrew Gallatin wrote:
> 
>> Nitin Hande wrote:
>>> Andrew Gallatin wrote:
>>
>>>> When looking at this, I noticed mac_tx_serializer_mode().  Am I reading
>>>> this right, in that is serializes a single queue?  That seems lacking,
>>>> compared to the nxge_serialize stuff it replaces.
>>> Yes. This part was done for nxge and as far as I remember recent 
>>> performance of this scheme was very close to  that of the previous 
>>> scheme. I think Gopi can comment more on this. What part do  you  
>>> think is missing here ?
>>
>> Perhaps I'm missing something..  Doesn't nxge support multiple TX rings?
>> If so, does the existing serialization serialize all traffic to a
>> single ring, or is mac_tx_serializer_mode() applied after 
>> mac_tx_fanout_mode()?
>>
>> I had thought the original nxge serializer serialized each TX ring
>> separately in nxge.  The fork I made of it for myri10ge certainly
>> works that way.
> 
> The serializer is only for use by the nxge driver which has an 
> inefficient TX path locking implementation. We didn't have the resources 
> to completely rewrite the nxge transmit path as part of the Crossbow 
> project so we moved the serialization implementation in MAC for that 
> driver. The serializer in MAC does serialization on a per-ring basis. 
> The serializer should not be used by any other driver.

krgopi said, in an earlier reply "mac_tx_serializer_mode() is used when 
you have a single Tx ring. nxge would not use that mode".  So I'm
confused.  From the source, it looks like nxge is using that mode
(MAC_VIRT_SERIALIZE |'ed into mi_v12n_level).

So I guess it is restricted to using only one of its hw tx rings, then?

> You don't have to use the serializer to support multiple TX rings. Keep 
> your TX path lean and mean, apply good design principles, e.g. avoid 
> holding locks for too long on your data-path, and you should be fine.

FWIW. my tx path is very "lean and mean".  The only time
locks are held are when writing the tx descriptors to the NIC,
and when allocating a dma handle from a pre-allocated per-ring pool.

I thought the serializer was silly too, but PAE claimed a speedup
from it.  I think that  PAE claimed the speedup came from
never back-pressuring the stack when the host overran the
NIC. One of the "features" of the serializer was to always
block the calling thread if the tx queue was exhausted.

Have you done any packets-per-second benchmarks with your
fanout code?  I'm concerned that its very cache unfriendly
if you have a long run of packets all going to the same
destination. This is because you walk the mblk chain, reading
the packet headers and queue up a big chain.  If the chain
gets too long, the mblk and/or the packet headers will be
pushed out of cache by the time they make it to the driver's
xmit routine.  So in this case you could have twice as many
cache misses as normal when things get really backed up.

Last, you (or somebody) mentioned there was interest in adding
a hook for a driver to do fanout.  Is there a bugid or something
for this?

Drew

Reply via email to