Sunay Tripathi wrote:
> Garrett,
>
> Garrett D'Amore wrote:
>> I've been thinking about hardware that has multiple transmit rings 
>> ("tx resources").
>>
>> We really should have a way to expose this up to the stack.  And 
>> ideally, the stack should guarantee that a given flow will always be 
>> sent down using the same hardware tx resource.
>>
>> I've heard that crossbow will deliver this, but I can't find evidence 
>> of it in the crossbow gate.  Am I missing something?  Is it 
>> functionality yet to be added, or is it not planned?
>
> Its designed in but code is yet to make it in Crossbow gate. I think
> parts of it are sitting in Roamer and Gopi's workspaces.

Okay.  Are there any design documents which provide the overall view of 
this?  I've read bits and pieces of crossbow, and the marketing 
literature, but I'd really like to have details all the way down the 
driver API level.

>
>> The other problem I've heard from PAE, which is that one potential 
>> approach drivers could use today, which is to map the flow by hashing 
>> the sending CPU (which one would expect not to change for a given 
>> flow) is doomed to suffer packet reordering.  Apparently the problem 
>> is that application threads can get get bounced around between CPUs 
>> by the scheduler pretty freely (more so than one would thing), and 
>> the result is that you can't assume that the sending CPU will be 
>> reasonably static for a given flow.  (I gotta think this wreaks havoc 
>> on the caches involved... but that's a different problem.)
>>
>> _If_ transmitted packets are sent to the stack and always land in a 
>> delivery queue, then perhaps the outbound queue (squeue?) can have a 
>> worker thread that doesn't migrate around.  But in order for that to 
>> happen, we have to stop having sending threads deliver right to the 
>> driver driver when intervening queues are empty.
>
> This doesn't really apply to forwarding traffic 

Agreed.  Although if we use multiple rings for forwarding, we still have 
to be careful to minimize reordering of the forwarded streams.

> and in case of traffic
> terminating on the host, the application thread very rarely is able to
> reach the driver directly (its about 17-18% of the time on web 
> workloads). The times it does means that there was nothing else to do
> anyway and its better to let the thread go through instead of doing
> a context switch.

I think this is a fallacy, even if you have observed it.

Because it ignores another potential location of queuing, which is the 
device driver (and the hardware) itself.  For example, some of the 
hardware rings have fairly deep TX queues -- up to 1,000 packets or more 
in some cases, which can lead to incorrect assumptions about just how 
busy the link really is.  And if you have multiple such rings, its 
really, really important to get the ordering right.

I also fear that the attempt to "let the packet pass thru" is an 
optimization for the case of a lightly loaded environment, without 
regard to the impact it places upon the driver.

Essentially, what I'm saying is, I am concerned that the design that 
requires the NIC driver to consider load balancing and flow management 
is inherently busted.  Its much, much better, I think, if the ordering 
and ring scheduling considerations be handled by the stack, without any 
brains whatsoever on the part of the driver.  Anything else leads to 
either a lot of wasted driver cycles, or drivers that make poor 
decisions because they don't have sufficient information.  I think we 
can see a bit of both in at least two of the drivers that support 
multiple tx rings: nxge and ce.

This also leads, I think, to some of the craziness that PAE has to do to 
manually tune the device drivers.  We really, I think, should be looking 
at ways to remove driver tuning from the steps that customers have to 
use to get good performance.

>
>> I _think_ this will work better for throughput.  It may hurt latency 
>> slightly though.  I haven't measured the latencies involved with 
>> queuing as opposed to direct delivery through the driver's 
>> xxx_send/xxx_start routine, but I'd be curious to know if others here 
>> have.
>
> Yes, you are discussing FireEngine design here. The ARC case has a 
> detailed document which discusses all these things. Can't remember the
> case number but search for FireEngine.

Thanks, I'll investigate further.

    -- Garrett

>
> Cheers,
> Sunay
>
>>
>> Anyway, let me know your thoughts.
>>
>>    -- Garrett
>>
>> _______________________________________________
>> crossbow-discuss mailing list
>> crossbow-discuss at opensolaris.org
>> http://opensolaris.org/mailman/listinfo/crossbow-discuss
>
>


Reply via email to