On 5/27/22 08:46, Michael Santana wrote: > > > On 5/27/22 02:37, Michael Santana wrote: >> The handler and CPU mapping in upcalls are incorrect, and this is >> specially noticeable systems with cpu isolation enabled. >> >> Say we have a 12 core system where only every even number CPU is enabled >> C0, C2, C4, C6, C8, C10 >> >> This means we will create an array of size 6 that will be sent to >> kernel that is populated with sockets [S0, S1, S2, S3, S4, S5] >> >> The problem is when the kernel does an upcall it checks the socket array >> via the index of the CPU, effectively adding additional load on some >> CPUs while leaving no work on other CPUs. >> >> e.g. >> >> C0 indexes to S0 >> C2 indexes to S2 (should be S1) >> C4 indexes to S4 (should be S2) >> >> Modulo of 6 (size of socket array) is applied, so we wrap back to S0 >> C6 indexes to S0 (should be S3) >> C8 indexes to S2 (should be S4) >> C10 indexes to S4 (should be S5) >> >> Effectively sockets S0, S2, S4 get overloaded while sockets S1, S3, S5 >> get no work assigned to them >> >> This leads to the kernel to throw the following message: >> "openvswitch: cpu_id mismatch with handler threads" >> >> Instead we will send the kernel a corrected array of sockets the size >> of all CPUs in the system. In the above example we would create a >> corrected array in a round-robin fashion as follows: >> [S0, S1, S2, S3, S4, S5, S0, S1, S2, S3, S4, S5] >> >> Fixes: b1e517bd2f81 ("dpif-netlink: Introduce per-cpu upcall dispatch.") >> >> Co-authored-by: Aaron Conole <acon...@redhat.com> >> signed-off-by: Aaron Conole <acon...@redhat.com> >> Signed-off-by: Michael Santana <msant...@redhat.com> > > > Hi Ilya, > > I sent this patch to get the conversation moving again. It implements the > round-robin schema you had proposed. > > However, there is a problem with the round robin-schema. But dont worry, I > think I have come up with an schema that will make everyone happy > > The problem with the round-robin schema (+magic prime number) is that it is > active-core agnostic. My understanding from conversations I have had with > flavio and aaron is that active cores have a high probability to see traffic > (by configuring the NIC to send traffic specifically to those active cores) > as this would be the "correctly" configured methodology.
I don't think this assumption is correct. Cores are not always strictly isolated in the kernel. In most common cases they will be just sliced with cgroups or isolated with tuned, so interrupts can still be processed on cores unavailable for OVS userpsace processes. And I will also say that it's likely much more performant configuration where kernel processing is happening on different core from where the userspace upcall handling happens. We may also have irqbalance moving irq affinity dynamically between cores. There is no such things as "inactive core" in a common system from the kernel's point of view, unless you shut it off entirely. Looking recently on some systems with high upcall rate, the CPU usage distribution between OVS and kernel appears to be 60/40. If the kernel processing will be moved to a different core, maximum upcall rate can, probbaly, be doubled. I didn't test that, but that is what I'm suspecting. We should ideally give priority to these active cores. So we should avoid sharing the same handler thread among cores, specially active cores, as you had also mentioned in this thread. > > With the round-robin schema you can easily deliberately create a worse case > scenario where 4 active cores are serviced by a single handler thread (s0). As I said before you can easily deliberately create a worst case scenario for any distribution scheme. We can't avoid such cases. The only way to avoid that is to create a thread per physical core. > > Say you have a 16 core system, with only cores c0, c5, c10, c15 active. > Using the prime number schema we would have 5 handler threads. The mapping > would look like this > > > ## ## ## ## > c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 > s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 > > > > Instead we can combine the design of the original patch and the round-robin > design. Doing so we avoid sharing handler threads among active cores. > > First we need more threads than active cores, like in the round-robin (+magic > prime number) implementation. I would like to change the prime number to > instead something more simple. I would like the total number of handler > threads to be: (this obviously only applies when we know we have inactive > cores (active_cores < total_cores)) > > handlers_n = (active_cores)+ min(active_cores/4 +1, inactive_cores/4 +1) > > The idea is to assigned each active core a unique handler (active_cores) that > will not be shared with any other core. The other handlers min(active_cores/4 > +1, inactive_cores/4 +1) will be distributed among the inactive cores in a > round-robin fashion. Actually, how we distribute handlers for inactive cores > is really up to you. We can optimize for whatever you think would be best. > The key take away is that active cores will always get unique handlers that > are not shared with any other core. And because we are adding only a small > number of additional handlers (mathematically speaking we would always only > be adding less than 1/8 of the total cores) for the inactive cores we should > not see any additional overhead. This schema will give a pretty bad distribution for "inactive" cores, since you're allocating only a few threads for them. It's way easier to deliberately create a worst case scenario here. It could be that we are also trying to solve the wrong problem altogether. We can't solve the distribution problem in userspace, because we don't know on which cores in the kernel packets will actually arrive. We're playing blind here. IRQs can also be dynamically moved around between cores. We can try and guess a good enough distribution and that is what we're trying to do, but unless we're creating a thread per core, there always will be a worst case scenario. To actually solve the problem we'll need to create a dynamic mapping inside the kernel, so the core can decide which socket to use the moment it needs to send it to userpsace based on some load calculation. Of course, the core should stick to a particular socket once chosen while there is some traffic on that core. To avoid packet re-ordering. Something similar to how we choose tx queue id for PMD threads in userspace datapath (see lib/dpif-netdev.c:dpif_netdev_xps_get_tx_qid). This can be a solution if we don't want to create a thread per core, and we probably don't. Sill nobody actually tested the scheduling overhead. If we don't want to solve the problem, but create a good enough solution in userspace instead, I think, we should not make any assumptions on what is likely and what is unlikely. All the setups are different. Some are more different then others. We should try to make a good distribution that will suite as many cases as possible, however it should still be decent for less common configurations. Especially because it's actually hard to say which configuration is the most common. We just don't have that data. Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev