On Thu, Jul 28, 2011 at 12:11 PM, Jesse Gross <[email protected]> wrote: > On Thu, Jul 28, 2011 at 10:16 AM, Pravin Shelar <[email protected]> wrote: >> On Wed, Jul 27, 2011 at 4:13 PM, Jesse Gross <[email protected]> wrote: >>> On Wed, Jul 27, 2011 at 2:24 PM, Pravin Shelar <[email protected]> wrote: >>>> On Wed, Jul 27, 2011 at 12:17 PM, Jesse Gross <[email protected]> wrote: >>>>> On Wed, Jul 27, 2011 at 11:14 AM, Ethan Jackson <[email protected]> wrote: >>>>>>> One strategy that I have considered is to be able to ask only for flows >>>>>>> that have a non-zero packet count. That would help with the common case >>>>>>> where, when there is a large number of flows, they are caused by a port >>>>>>> scan or some other activity with 1-packet flows. It wouldn't help at >>>>>>> all in your case. >>>>>> >>>>>> You could also have the kernel pass down to userspace what logically >>>>>> amounts to a list of the flows which have had their statistics change >>>>>> in the past 10 seconds. A bloom filter would be a sensible approach. >>>>>> Again, probably won't help at all in Simon's case, and may or may-not >>>>>> be a useful optimization above simply not pushing down statistics for >>>>>> flows which have a zero packet count. >>>>> >>>>> I don't think that you could implement a Bloom filter like this in a >>>>> manner that wouldn't cause cache contention. Probably you would still >>>>> need to iterate over every flow in the kernel, you would just be >>>>> comparing last used time to current time - 10 instead of packet count >>>>> not equal to zero. >>>>> >>>> cpu cache contention can be fixed by partitioning all flow by >>>> something (e.g. port no) and assigning cache replacement processing >>>> to a cpu. replacement algo could simple as active and inactive LRU >>>> list. this is how kernel page cache replacement looks like from high >>>> level. >>> >>> This isn't really a cache replacement problem though. Maybe that's >>> the high level goal that's being solved but I wouldn't want to make >>> that assumption in the kernel as it would likely impose too many >>> restrictions on what userspace can do if it wants to implement >>> something completely different in the future. Anything the kernel >>> provides should just be a simple primitive, potentially analogous to >>> the referenced bit that you would find in a page table. >>> >>> You also can't impose a CPU partitioning scheme on flows because we >>> don't control the CPU that packets are being processed on. That's >>> determined by the originator of the packet (such as RSS on the NIC) >>> and then we just handle it on the same CPU. However, you can use a >>> per-CPU data structure to store information regardless of flow and >>> then merge them later. This actually works well enough for something >>> like a Bloom filter because you can superimpose the results on top of >>> each other without a problem. >> >> I am not sure why packet CPU can not be controlled by using interrupt >> affinity/RSS. >> >> I think partitioning on basis of cpu or port number is good for scalability. > > If you're using a NIC with RSS you'll get partitioning on a per-flow > basis, which will provide better load balancing than doing it on the > basis of port. > > If you're receiving packets from a virtual interface from a VM, the > CPU depends on what the interface wants to do. It could be pretty > much anything, though for example Xen does it on a per-port basis. > > If send from an application on the local port, the CPU will be > whichever one is executing the application. > > So I'm not saying that it can't be controlled or isn't already > partitioned, just that you can't rely on any particular scheme within > OVS unless you want to use an IPI, which certainly isn't going to help > performance. >
After discussing, we think best way to partition hash-table is on basis of CPU of very first packet of flow. most likely all packets for a given flow should arrive on same cpu, irrespective of origin of packets. There is possibility of some duplication of flows, but that should be minimum in practice. Thanks, Pravin. _______________________________________________ dev mailing list [email protected] http://openvswitch.org/mailman/listinfo/dev
