On Thu, Jul 28, 2011 at 12:11 PM, Jesse Gross <[email protected]> wrote:
> On Thu, Jul 28, 2011 at 10:16 AM, Pravin Shelar <[email protected]> wrote:
>> On Wed, Jul 27, 2011 at 4:13 PM, Jesse Gross <[email protected]> wrote:
>>> On Wed, Jul 27, 2011 at 2:24 PM, Pravin Shelar <[email protected]> wrote:
>>>> On Wed, Jul 27, 2011 at 12:17 PM, Jesse Gross <[email protected]> wrote:
>>>>> On Wed, Jul 27, 2011 at 11:14 AM, Ethan Jackson <[email protected]> wrote:
>>>>>>> One strategy that I have considered is to be able to ask only for flows
>>>>>>> that have a non-zero packet count.  That would help with the common case
>>>>>>> where, when there is a large number of flows, they are caused by a port
>>>>>>> scan or some other activity with 1-packet flows.  It wouldn't help at
>>>>>>> all in your case.
>>>>>>
>>>>>> You could also have the kernel pass down to userspace what logically
>>>>>> amounts to a list of the flows  which have had their statistics change
>>>>>> in the past 10 seconds.  A bloom filter would be a sensible approach.
>>>>>> Again, probably won't help at all in Simon's case, and may or may-not
>>>>>> be a useful optimization above simply not pushing down statistics for
>>>>>> flows which have a zero packet count.
>>>>>
>>>>> I don't think that you could implement a Bloom filter like this in a
>>>>> manner that wouldn't cause cache contention.  Probably you would still
>>>>> need to iterate over every flow in the kernel, you would just be
>>>>> comparing last used time to current time - 10 instead of packet count
>>>>> not equal to zero.
>>>>>
>>>> cpu cache contention can be fixed by partitioning all flow by
>>>> something (e.g. port no)  and assigning cache replacement processing
>>>> to a cpu. replacement algo could simple as active and inactive LRU
>>>> list. this is how kernel page cache replacement looks like from high
>>>> level.
>>>
>>> This isn't really a cache replacement problem though.  Maybe that's
>>> the high level goal that's being solved but I wouldn't want to make
>>> that assumption in the kernel as it would likely impose too many
>>> restrictions on what userspace can do if it wants to implement
>>> something completely different in the future.  Anything the kernel
>>> provides should just be a simple primitive, potentially analogous to
>>> the referenced bit that you would find in a page table.
>>>
>>> You also can't impose a CPU partitioning scheme on flows because we
>>> don't control the CPU that packets are being processed on.  That's
>>> determined by the originator of the packet (such as RSS on the NIC)
>>> and then we just handle it on the same CPU.  However, you can use a
>>> per-CPU data structure to store information regardless of flow and
>>> then merge them later.  This actually works well enough for something
>>> like a Bloom filter because you can superimpose the results on top of
>>> each other without a problem.
>>
>> I am not sure why packet CPU can not be controlled by using interrupt
>> affinity/RSS.
>>
>> I think partitioning on basis of cpu or port number is good for scalability.
>
> If you're using a NIC with RSS you'll get partitioning on a per-flow
> basis, which will provide better load balancing than doing it on the
> basis of port.
>
> If you're receiving packets from a virtual interface from a VM, the
> CPU depends on what the interface wants to do.  It could be pretty
> much anything, though for example Xen does it on a per-port basis.
>
> If send from an application on the local port, the CPU will be
> whichever one is executing the application.
>
> So I'm not saying that it can't be controlled or isn't already
> partitioned, just that you can't rely on any particular scheme within
> OVS unless you want to use an IPI, which certainly isn't going to help
> performance.
>

After discussing, we think best way to partition hash-table is on
basis of CPU of very first packet of flow. most likely all packets for
a given flow should arrive on same cpu, irrespective of origin of
packets.

There is possibility of some duplication of flows, but that should be
minimum in practice.

Thanks,
Pravin.
_______________________________________________
dev mailing list
[email protected]
http://openvswitch.org/mailman/listinfo/dev

Reply via email to