On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote: > On 2/28/24 16:10, Tobias Huschle wrote: > > > > Questions: > > 1. The kworker getting its negative lag occurs in the following scenario > > - kworker and a cgroup are supposed to execute on the same CPU > > - one task within the cgroup is executing and wakes up the kworker > > - kworker with 0 lag, gets picked immediately and finishes its > > execution within ~5000ns > > - on dequeue, kworker gets assigned a negative lag > > Is this expected behavior? With this short execution time, I would > > expect the kworker to be fine. > > That strikes me as a bit odd as well. Have you been able to determine how a > negative lag > is assigned to the kworker after such a short runtime? >
I did some more trace reading though and found something. What I observed if everything runs regularly: - vhost and kworker run alternating on the same CPU - if the kworker is done, it leaves the runqueue - vhost wakes up the kworker if it needs it --> this means: - vhost starts alone on an otherwise empty runqueue - it seems like it never gets dequeued (unless another unrelated task joins or migration hits) - if vhost wakes up the kworker, the kworker gets selected - vhost runtime > kworker runtime --> kworker gets positive lag and gets selected immediately next time What happens if it does go wrong: >From what I gather, there seem to be occasions where the vhost either executes suprisingly quick, or the kworker surprinsingly slow. If these outliers reach critical values, it can happen, that vhost runtime < kworker runtime which now causes the kworker to get the negative lag. In this case it seems like, that the vhost is very fast in waking up the kworker. And coincidentally, the kworker takes, more time than usual to finish. We speak of 4-digit to low 5-digit nanoseconds. So, for these outliers, the scheduler extrapolates that the kworker out-consumes the vhost and should be slowed down, although in the majority of other cases this does not happen. Therefore this particular usecase would profit from being able to ignore such outliers, or being able to ignore a certain amount of difference in the lag values, i.e. introduce some grace value around the average runtime for which lag is not accounted. But not sure if I like that idea. So the negative lag can be somewhat justified, but for this particular case it leads to a problem where one outlier can cause havoc. As mentioned in the vhost discussion, it could also be argued that the vhost should not rely on the fact that the kworker gets always scheduled on wake up, since these timing issues can always happen. Hence, the two options: - offer the alternative strategy which dismisses lag on wake up for workloads where we know that a task usually finishes faster than others but should not be punished by rare outliers (if that is predicatble, I don't know) - require vhost to adress this issue on their side (if possible without creating an armada of side effects) (plus the third one mentioned above, but that requires a magic cutoff value, meh) > I was looking at a different thread > (https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.c...@intel.com/) > that > uncovers a potential overflow in the eligibility calculation. Though I don't > think that is the case for this particular > vhost problem. Yea, the numbers I see do not look very overflowy.