On 19 Feb 2024, at 20:23, Ilya Maximets wrote:
> On 2/8/24 16:53, Eelco Chaudron wrote: >> >> >> On 8 Feb 2024, at 15:00, Ilya Maximets wrote: >> >>> On 2/8/24 13:44, Eelco Chaudron wrote: >>>> >>>> >>>> On 6 Feb 2024, at 16:01, Ilya Maximets wrote: >>>> >>>>> On 2/6/24 15:07, Eelco Chaudron wrote: >>>>>> >>>>>> >>>>>> On 6 Feb 2024, at 14:46, Ilya Maximets wrote: >>>>>> >>>>>>> On 2/6/24 14:30, Eelco Chaudron wrote: >>>>>>>> Previously, the ability to override the system default for the number >>>>>>>> of handler threads was broken since to the introduction of the per-CPU >>>>>>>> upcall handlers. >>>>>>> >>>>>>> It wasn't broken, it was intentionally disabled. I don't think we >>>>>>> should >>>>>>> introduce ability to configure the number of handlers for per-CPU >>>>>>> dispatch >>>>>>> mode. It only allows users to shoot themselves in a foot if they do not >>>>>>> understand what they are doing. >>>>>> >>>>>> I think the design was not thought through enough, and resulted in a lot >>>>>> of >>>>>> corner case issues. From the start, we should still have allowed tuning >>>>>> of >>>>>> this option as it was before. >>>>> >>>>> The tuning is much more complicated and has very different consequences >>>>> with per-cpu dispatch compared to per-vport. So, the results of adjusting >>>>> this option can be very different. >>>> >>>>>>> Many integration scripts in a wild are still blindly setting these >>>>>>> options >>>>>>> without even knowing that the underlying implementation changed >>>>>>> completely. >>>>>>> And those setups may suffer if we suddenly decide to respect the >>>>>>> configuration >>>>>>> that is ignored on modern kernels for all currently supported versions >>>>>>> of OVS. >>>>>> >>>>>> I feel like this is not OVS’s problem. If people do not know what they >>>>>> are doing, >>>>>> they should not touch it… We have a nice Dutch saying for this "Wie zijn >>>>>> billen >>>>>> brandt, moet op de blaren zitten” :) >>>>> >>>>> It is an OVS's problem for the same reason as above. We changed the logic >>>>> under the hood so much that users that expected a certain behavior in the >>>>> past may have a very different behavior now. We can't just change the >>>>> meaning of the knob that easily. See below. >>>> >>>> Ack, so I would suggest adding a new nob. >>>> >>>>>> >>>>>> However, I do feel like we should have an option to configure this, as >>>>>> it makes >>>>>> no sense on a 256-core system to create 256 handler threads. >>>>> >>>>> It actually makes perfect sense. Since the traffic can appear on 256 >>>>> cores, we >>>>> should be able to handle it on each one of them in order to have balanced >>>>> load. >>>>> >>>>> With per-vport dispatch, on a system with 10 cores and 5 handler threads, >>>>> all 5 handler threads will share the load, because each one of them >>>>> listens >>>>> on all the ports, hence each one of them can process packets whenever it >>>>> has time to do so. >>>> >>>> Yes, but this had other problems :) >>> >>> Sure, that's why we went away from that logic. :) >>> >>>> >>>>> With per-CPU dispatch, on a same system these 5 threads are mapped to >>>>> specific >>>>> cores, 2 cores per thread. If all the traffic comes on 4 cores, we may >>>>> have >>>>> 2 threads overloaded and 3 other threads do nothing. In the same >>>>> scenario with >>>>> per-vport dispatch we would have all 5 threads sharing the load. >>>> >>>> Yes, but with 256 we might have 250 doing nothing, and looking at some >>>> data from a >>>> system with decent OpenFlow tables (1M+ entries, but only a couple of >>>> tables), we >>>> can handle about 232k upcalls on a single handler thread for 64 bytes >>>> packets before >>>> we run out of socket buffer. For 512bytes around 212k, and 1500 around >>>> 189k. >>> >>> Having idle threads in a system is generally not a problem. As long as we >>> don't have hundreds of thousands of them, they will not impact scheduling >>> performance. >>> >>> Also, actual performance of handlers is hard to quantify, since OVS can be >>> running on a large variety of different hardware. Yours is likely on a >>> powerful side of a spectrum. >>> >>>> >>>> If we have more than 200k upcalls per second, we have other problems, >>>> especially as >>>> the default dp flow size is 200k. >>>> >>>> So I guess the new nob becomes tunable for resource consumption vs upcall >>>> handling, >>>> which I think is fine, and might be desirable in certain configurations. >>>> We have other >>>> nobs in the system that require a decent amount of knowledge. >>> >>> Having more knobs just for the sake of having something configurable >>> doesn't sound good to me. We should try to not introduce ways to >>> configure something that doesn't need to be configured. >>> >>> Sure, creating spare threads consumes memory. However there are few >>> things to consider here: >>> >>> 1. A system with 256 cores likely has a 512 MB of RAM to spare >>> (By default we limit to 2 MB of stack per thread). >>> I haven't seen any complaint from users about memory consumption >>> on large systems. >>> >>> 2. By reducing the number of threads we're working around the problem >>> of a large stack use during flow translation. If we can reduce the >>> stack usage, we can reduce the overall memory consumption. >>> So, maybe that should be addressed instead of working around with >>> a hard-to-use-right knob. >> >> I guess the problem is that various people have tried, and the stack size >> did get reduced, >> but not enough. The code is very complex and due to all the corner cases >> introduced over >> the years, it’s (almost) impossible to get the recursion removed, without a >> complete rewrite >> (trying to not break all these corner cases). > > I don't remember anyone actually trying to reduce the recursion depth. > But the was some work reducing the stack allocations, true. > > It's not necessary to completely remove recursion, but it should be > possible to short-circuit it in most common cases. Mike did some stack reducing, but Aaron has a task to limit the recursion. I thought he hit a dead end, but maybe he can comment, or you can share your ideas with him. >> >> And in the past, we did receive complaints about the memory size. In some >> cases, we still >> need 8M, which is 2G on a 256-core system. >> >> Unless someone finds a solution to this, I still feel we should have this >> option in case >> we need it. > > CPU affinity is an available option, as discussed blow. > > Another option we can explore is to use MCL_ONFAULT instead of locking > all the memory right away. MCL_ONFAULT didn't exist when --mlockall > was introduced, so it may be reasonable to use it. Some testing on a > systems under memory pressure is needed, of course. But if the main > concern is that OVS allocates too much memory that it doesn't use, it > will no longer be the case. I’ll do some experimenting, and try to find the use case that would break on a large machine. To be continued… In the meantime, I’ll re-send a v2 only containing the first patch. //Eelco >>> 3. We can probably limit the number of revalidators instead (without >>> a knob). We may estimate the number of revalidators needed by the >>> datapath flow hard limit and not create more than that even if we >>> have more cores. Should help with memory consumption as well. >>> >>>> >>>>> So, in order for the user to make a decision on the number of threads, >>>>> they >>>>> have to know on which CPUs they most likely to have interrupts and how >>>>> exactly >>>>> OVS will map handlers to CPU numbers. That is too much for a user to >>>>> know, IMO. >>>>> Some of that is actually impossible to know. And when the traffic >>>>> pattern changes, >>>>> the number of threads may need to be changed to game the mapping to avoid >>>>> overload. >>>>> >>>>> Also, multiply that complexity by the CPU affinity of the ovs-vswitchd >>>>> that >>>>> changes the mapping. >>>> >>>> Well, using an affinity mask will mess up the above anyway as we will get >>>> fewer >>>> threads, sharing potential CPU resources, etc. So making the n_threads >>>> configurable >>>> will have the same problem as changing the affinity mask. But we might >>>> have other >>>> benefits (see below). >>>> >>>>> AFAIU, The only way to achieve an even load without dynamic >>>>> reconfiguration is >>>>> to have a thread per CPU core. >>>> >>>> Yes, only if we have no affinity mask, but at the cost of additional >>>> resources >>>> (which we might not want). >>>> >>>>>> Maybe we should >>>>>> add a new option, n-forced-handler-threads. This will avoid the above >>>>>> issue with >>>>>> existing scripts. >>>>> >>>>> Users can limit the number of threads by setting CPU affinity for >>>>> ovs-vswitchd. >>>>> If these users are knowledgeable enough to predict traffic patterns and >>>>> CPU >>>>> mappings, they can set appropriate CPU affinity masks as well. >>>> >>>> Well, you can not exactly set the number of handler threads as there will >>>> be >>>> round-up to prime (which should not be a big problem). But the main >>>> problem I see >>>> is that now you limit all OVS to that set of CPU. So a burst of upcall >>>> could mess >>>> with your revalidator handling (or other threads running on that set). >>> >>> Should not be a problem. We always run handlers and revalidators on the >>> same >>> cores by default. And revalidators do not need millisecond precision. Also >>> revalidators can float between cores. So, the use of the same cores should >>> not be a problem as long as not all handlers are at 100% load, which will be >>> a problem on its own regardless. >>> >>>> >>>>> For others, changing the meaning of a knob underneath them would be kind >>>>> of irresponsible from our side. >>>> >>>> Ack, I will introduce a new nob (also some changes to dpif-netlink, as the >>>> patch has a bug). >>>> >>> >>> I'm still not convinced we need a knob for the reasons listed above. >>> >>> Best regards, Ilya Maximets. >> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev