On 19 Feb 2024, at 20:23, Ilya Maximets wrote:

> On 2/8/24 16:53, Eelco Chaudron wrote:
>>
>>
>> On 8 Feb 2024, at 15:00, Ilya Maximets wrote:
>>
>>> On 2/8/24 13:44, Eelco Chaudron wrote:
>>>>
>>>>
>>>> On 6 Feb 2024, at 16:01, Ilya Maximets wrote:
>>>>
>>>>> On 2/6/24 15:07, Eelco Chaudron wrote:
>>>>>>
>>>>>>
>>>>>> On 6 Feb 2024, at 14:46, Ilya Maximets wrote:
>>>>>>
>>>>>>> On 2/6/24 14:30, Eelco Chaudron wrote:
>>>>>>>> Previously, the ability to override the system default for the number
>>>>>>>> of handler threads was broken since to the introduction of the per-CPU
>>>>>>>> upcall handlers.
>>>>>>>
>>>>>>> It wasn't broken, it was intentionally disabled.  I don't think we 
>>>>>>> should
>>>>>>> introduce ability to configure the number of handlers for per-CPU 
>>>>>>> dispatch
>>>>>>> mode.  It only allows users to shoot themselves in a foot if they do not
>>>>>>> understand what they are doing.
>>>>>>
>>>>>> I think the design was not thought through enough, and resulted in a lot 
>>>>>> of
>>>>>> corner case issues. From the start, we should still have allowed tuning 
>>>>>> of
>>>>>> this option as it was before.
>>>>>
>>>>> The tuning is much more complicated and has very different consequences
>>>>> with per-cpu dispatch compared to per-vport.  So, the results of adjusting
>>>>> this option can be very different.
>>>>
>>>>>>> Many integration scripts in a wild are still blindly setting these 
>>>>>>> options
>>>>>>> without even knowing that the underlying implementation changed 
>>>>>>> completely.
>>>>>>> And those setups may suffer if we suddenly decide to respect the 
>>>>>>> configuration
>>>>>>> that is ignored on modern kernels for all currently supported versions 
>>>>>>> of OVS.
>>>>>>
>>>>>> I feel like this is not OVS’s problem. If people do not know what they 
>>>>>> are doing,
>>>>>> they should not touch it… We have a nice Dutch saying for this "Wie zijn 
>>>>>> billen
>>>>>> brandt, moet op de blaren zitten” :)
>>>>>
>>>>> It is an OVS's problem for the same reason as above.  We changed the logic
>>>>> under the hood so much that users that expected a certain behavior in the
>>>>> past may have a very different behavior now.  We can't just change the
>>>>> meaning of the knob that easily.  See below.
>>>>
>>>> Ack, so I would suggest adding a new nob.
>>>>
>>>>>>
>>>>>> However, I do feel like we should have an option to configure this, as 
>>>>>> it makes
>>>>>> no sense on a 256-core system to create 256 handler threads.
>>>>>
>>>>> It actually makes perfect sense.  Since the traffic can appear on 256 
>>>>> cores, we
>>>>> should be able to handle it on each one of them in order to have balanced 
>>>>> load.
>>>>>
>>>>> With per-vport dispatch, on a system with 10 cores and 5 handler threads,
>>>>> all 5 handler threads will share the load, because each one of them 
>>>>> listens
>>>>> on all the ports, hence each one of them can process packets whenever it
>>>>> has time to do so.
>>>>
>>>> Yes, but this had other problems :)
>>>
>>> Sure, that's why we went away from that logic. :)
>>>
>>>>
>>>>> With per-CPU dispatch, on a same system these 5 threads are mapped to 
>>>>> specific
>>>>> cores, 2 cores per thread.  If all the traffic comes on 4 cores, we may 
>>>>> have
>>>>> 2 threads overloaded and 3 other threads do nothing.  In the same 
>>>>> scenario with
>>>>> per-vport dispatch we would have all 5 threads sharing the load.
>>>>
>>>> Yes, but with 256 we might have 250 doing nothing, and looking at some 
>>>> data from a
>>>> system with decent OpenFlow tables (1M+ entries, but only a couple of 
>>>> tables), we
>>>> can handle about 232k upcalls on a single handler thread for 64 bytes 
>>>> packets before
>>>> we run out of socket buffer. For 512bytes around 212k, and 1500 around 
>>>> 189k.
>>>
>>> Having idle threads in a system is generally not a problem.  As long as we
>>> don't have hundreds of thousands of them, they will not impact scheduling
>>> performance.
>>>
>>> Also, actual performance of handlers is hard to quantify, since OVS can be
>>> running on a large variety of different hardware.  Yours is likely on a
>>> powerful side of a spectrum.
>>>
>>>>
>>>> If we have more than 200k upcalls per second, we have other problems, 
>>>> especially as
>>>> the default dp flow size is 200k.
>>>>
>>>> So I guess the new nob becomes tunable for resource consumption vs upcall 
>>>> handling,
>>>> which I think is fine, and might be desirable in certain configurations. 
>>>> We have other
>>>> nobs in the system that require a decent amount of knowledge.
>>>
>>> Having more knobs just for the sake of having something configurable
>>> doesn't sound good to me.  We should try to not introduce ways to
>>> configure something that doesn't need to be configured.
>>>
>>> Sure, creating spare threads consumes memory.  However there are few
>>> things to consider here:
>>>
>>> 1. A system with 256 cores likely has a 512 MB of RAM to spare
>>>    (By default we limit to 2 MB of stack per thread).
>>>    I haven't seen any complaint from users about memory consumption
>>>    on large systems.
>>>
>>> 2. By reducing the number of threads we're working around the problem
>>>    of a large stack use during flow translation.  If we can reduce the
>>>    stack usage, we can reduce the overall memory consumption.
>>>    So, maybe that should be addressed instead of working around with
>>>    a hard-to-use-right knob.
>>
>> I guess the problem is that various people have tried, and the stack size 
>> did get reduced,
>> but not enough. The code is very complex and due to all the corner cases 
>> introduced over
>> the years, it’s (almost) impossible to get the recursion removed, without a 
>> complete rewrite
>> (trying to not break all these corner cases).
>
> I don't remember anyone actually trying to reduce the recursion depth.
> But the was some work reducing the stack allocations, true.
>
> It's not necessary to completely remove recursion, but it should be
> possible to short-circuit it in most common cases.

Mike did some stack reducing, but Aaron has a task to limit the recursion. I 
thought he hit a dead end, but maybe he can comment, or you can share your 
ideas with him.

>>
>> And in the past, we did receive complaints about the memory size. In some 
>> cases, we still
>> need 8M, which is 2G on a 256-core system.
>>
>> Unless someone finds a solution to this, I still feel we should have this 
>> option in case
>> we need it.
>
> CPU affinity is an available option, as discussed blow.
>
> Another option we can explore is to use MCL_ONFAULT instead of locking
> all the memory right away.  MCL_ONFAULT didn't exist when --mlockall
> was introduced, so it may be reasonable to use it.  Some testing on a
> systems under memory pressure is needed, of course.  But if the main
> concern is that OVS allocates too much memory that it doesn't use, it
> will no longer be the case.

I’ll do some experimenting, and try to find the use case that would break on a 
large machine. To be continued…

In the meantime, I’ll re-send a v2 only containing the first patch.

//Eelco

>>> 3. We can probably limit the number of revalidators instead (without
>>>    a knob).  We may estimate the number of revalidators needed by the
>>>    datapath flow hard limit and not create more than that even if we
>>>    have more cores.  Should help with memory consumption as well.
>>>
>>>>
>>>>> So, in order for the user to make a decision on the number of threads, 
>>>>> they
>>>>> have to know on which CPUs they most likely to have interrupts and how 
>>>>> exactly
>>>>> OVS will map handlers to CPU numbers.  That is too much for a user to 
>>>>> know, IMO.
>>>>> Some of that is actually impossible to know.  And when the traffic 
>>>>> pattern changes,
>>>>> the number of threads may need to be changed to game the mapping to avoid
>>>>> overload.
>>>>>
>>>>> Also, multiply that complexity by the CPU affinity of the ovs-vswitchd 
>>>>> that
>>>>> changes the mapping.
>>>>
>>>> Well, using an affinity mask will mess up the above anyway as we will get 
>>>> fewer
>>>> threads, sharing potential CPU resources, etc. So making the n_threads 
>>>> configurable
>>>> will have the same problem as changing the affinity mask. But we might 
>>>> have other
>>>> benefits (see below).
>>>>
>>>>> AFAIU, The only way to achieve an even load without dynamic 
>>>>> reconfiguration is
>>>>> to have a thread per CPU core.
>>>>
>>>> Yes, only if we have no affinity mask, but at the cost of additional 
>>>> resources
>>>> (which we might not want).
>>>>
>>>>>> Maybe we should
>>>>>> add a new option, n-forced-handler-threads. This will avoid the above 
>>>>>> issue with
>>>>>> existing scripts.
>>>>>
>>>>> Users can limit the number of threads by setting CPU affinity for 
>>>>> ovs-vswitchd.
>>>>> If these users are knowledgeable enough to predict traffic patterns and 
>>>>> CPU
>>>>> mappings, they can set appropriate CPU affinity masks as well.
>>>>
>>>> Well, you can not exactly set the number of handler threads as there will 
>>>> be
>>>> round-up to prime (which should not be a big problem). But the main 
>>>> problem I see
>>>> is that now you limit all OVS to that set of CPU. So a burst of upcall 
>>>> could mess
>>>> with your revalidator handling (or other threads running on that set).
>>>
>>> Should not be a problem.  We always run handlers and revalidators on the 
>>> same
>>> cores by default.  And revalidators do not need millisecond precision.  Also
>>> revalidators can float between cores.  So, the use of the same cores should
>>> not be a problem as long as not all handlers are at 100% load, which will be
>>> a problem on its own regardless.
>>>
>>>>
>>>>> For others, changing the meaning of a knob underneath them would be kind
>>>>> of irresponsible from our side.
>>>>
>>>> Ack, I will introduce a new nob (also some changes to dpif-netlink, as the 
>>>> patch has a bug).
>>>>
>>>
>>> I'm still not convinced we need a knob for the reasons listed above.
>>>
>>> Best regards, Ilya Maximets.
>>

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to