rajagopal kunhappan wrote:
> Andrew Gallatin wrote:
>>>>
>>>> FWIW. my tx path is very "lean and mean".  The only time
>>>> locks are held are when writing the tx descriptors to the NIC,
>>>> and when allocating a dma handle from a pre-allocated per-ring pool.
>>>>
>>>> I thought the serializer was silly too, but PAE claimed a speedup
>>>> from it.  I think that  PAE claimed the speedup came from
>>>> never back-pressuring the stack when the host overran the
>>>> NIC. One of the "features" of the serializer was to always
>>>> block the calling thread if the tx queue was exhausted.
>>>>
>>>> Have you done any packets-per-second benchmarks with your
>>>> fanout code?  I'm concerned that its very cache unfriendly
>>>
>>> With nxge we get line rate with MTU sized packets with 8 Tx rings. 
>>> The numbers are similar to what is was with nxge serializer in place.
>>>
>>>> if you have a long run of packets all going to the same
>>>> destination. This is because you walk the mblk chain, reading
>>>> the packet headers and queue up a big chain.  If the chain
>>>> gets too long, the mblk and/or the packet headers will be
>>>> pushed out of cache by the time they make it to the driver's
>>>> xmit routine.  So in this case you could have twice as many
>>>> cache misses as normal when things get really backed up.
>>>
>>> We would like to have the drivers operate in non-serialized mode. But 
>>> if for whatever reason, you want to use serialized mode, and there 
>>> are issues, we can look into that,
>>
>> The only reason I care about the serializer is the pre-crossbow
>> feedback from PAE that the original serializer avoided
>> putting backpressure on the stack when the TX rings fill up.
> 
> Yes, pre-crossbow putting back pressure would mean queue'ing up packets 
> in DLD. Thus all packets get queued in DLD until the driver relieved the 
> flow control. By then thousands of packets would be sitting in DLD (This 
> is because TCP does not check for STREAM QFULL condition on the DLD 
> write queue and keeps on sending packets). After flow control is 
> relieved, the queued up packets are drained by dld_wsrv() (in single 
> threaded mode). Single thread is no good on a 10gig link and thus caused 
> performance issues.

And crossbow addresses this?

>> I'm happy using the normal fanout (with some caveats below) as
>> long as PAE doesn't complain about it later.
>>
>> The caveats being that I want an fanout mode that uses a
>> standard Toeplitz hash so as to maintain CPU locality.
> 
> I am curious as to how you maintain CPU locality for Tx traffic. Can you 
> give some details?
> 
> On Solaris stack, if you have a bunch of say TCP connections sending 
> traffic, they can come from any CPU on the system. By this I mean what 
> CPU an application runs on is completely random unless you do CPU binding.
> 
> I can see tying Rx traffic to a specific Rx ring and CPU. If it is a 
> forwarding case, then one can tie an Rx ring to a Tx ring.


On OSes other than windows, this helps mainly on the TCP receive side,
in that acks will flow out the same CPU that handled the receive
(assuming a direct dispatch from ISR through to the TCP/IP stack).

AFAIK, only Windows can really control affinity to a fine level, since
they require a toeplitz hash, and you must provide hooks to use their
"key", and to update your indirection table.  This means they can
control affinities for connections (or at least sets of connections)
and update them on the fly to match the application's affinity.
According to our Windows guy, they really use this stuff.

But all of this depends on the OS and the NIC agreeing on the hash.
Is there any reason (patents?  complexity?  perception that the
windows solution is inferior?) that crossbow does not try to take
the windows approach?  Essentially all NICs available today that support
multiple RX queues also support all this other stuff that
Windows requires.  Why not take advantage of it?

Drew

Reply via email to