Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/07/2011 12:29 AM, Mike Galbraith wrote: +#ifdef CONFIG_SMP + /* +* If this yield is important enough to want to preempt instead +* of only dropping a -next hint, we're alone, and the target +* is not alone, pull the target to this cpu. +* +* NOTE: the target may be alone in it's cfs_rq if another class +* task or another task group is currently executing on it's cpu. +* In this case, we still pull, to accelerate it toward the cpu. +*/ + if (cfs_rq != p_cfs_rq preempt cfs_rq-nr_running == 1 + cpumask_test_cpu(this_cpu,p-cpus_allowed)) { + pull_task(task_rq(p), p, this_rq(), this_cpu); + p_cfs_rq = cfs_rq_of(pse); + } +#endif This causes some fun issues in a simple test case on my system. The test consists of 2 4-VCPU KVM guests, bound to the same 4 CPUs on the host. One guest is running the AMQP performance test, the other guest is totally idle. This means that besides the 4 very busy VCPUs, there is only a few percent CPU use in background tasks from the idle guest and qemu-kvm userspace bits. However, the busy guest is restricted to using just 3 out of the 4 CPUs, leaving one idle! A simple explanation for this is that the above pulling code will pull another VCPU onto the local CPU whenever we run into contention inside the guest and some random background task runs on the CPU where that other VCPU was. From that point on, the 4 VCPUs will stay on 3 CPUs, leaving one idle. Any time we have contention inside the guest (pretty frequent), we move whoever is not currently running to another CPU. Cgroups only makes the matter worse - libvirt places each KVM guest into its own cgroup, so a VCPU will generally always be alone on its own per-cgroup, per-cpu runqueue! That can lead to pulling a VCPU onto our local CPU because we think we are alone, when in reality we share the CPU with others... Removing the pulling code allows me to use all 4 CPUs with a 4-VCPU KVM guest in an uncontended situation. + /* Tell the scheduler that we'd really like pse to run next. */ + p_cfs_rq-next = pse; Using set_next_buddy propagates this up to the root, allowing the scheduler to actually know who we want to run next when cgroups is involved. + /* We know whether we want to preempt or not, but are we allowed? */ + if (preempt same_thread_group(p, task_of(p_cfs_rq-curr))) + resched_task(task_of(p_cfs_rq-curr)); With this in place, we can get into the situation where we will gladly give up CPU time, but not actually give any to the other VCPUs in our guest. I believe we can get rid of that test, because pick_next_entity already makes sure it ignores -next if picking -next would lead to unfairness. Removing this test (and simplifying yield_to_task_fair) seems to lead to more predictable test results. I'll send the updated patch in another email, since this one is already way too long for a changelog :) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, 2011-01-12 at 22:02 -0500, Rik van Riel wrote: Cgroups only makes the matter worse - libvirt places each KVM guest into its own cgroup, so a VCPU will generally always be alone on its own per-cgroup, per-cpu runqueue! That can lead to pulling a VCPU onto our local CPU because we think we are alone, when in reality we share the CPU with others... How can that happen? If the task you're trying to accelerate isn't in your task group, the whole attempt should be a noop. Removing the pulling code allows me to use all 4 CPUs with a 4-VCPU KVM guest in an uncontended situation. + /* Tell the scheduler that we'd really like pse to run next. */ + p_cfs_rq-next = pse; Using set_next_buddy propagates this up to the root, allowing the scheduler to actually know who we want to run next when cgroups is involved. + /* We know whether we want to preempt or not, but are we allowed? */ + if (preempt same_thread_group(p, task_of(p_cfs_rq-curr))) + resched_task(task_of(p_cfs_rq-curr)); With this in place, we can get into the situation where we will gladly give up CPU time, but not actually give any to the other VCPUs in our guest. I believe we can get rid of that test, because pick_next_entity already makes sure it ignores -next if picking -next would lead to unfairness. Preempting everybody who is in your way isn't playing nice neighbor, so I think at least the same_thread_group() test needs to stay. But that's Peter's call. Starting a zillion threads to play wakeup preempt and lets hog the cpu isn't nice either, but it's allowed. Removing this test (and simplifying yield_to_task_fair) seems to lead to more predictable test results. Less is more :) -Mike -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/12/2011 10:26 PM, Mike Galbraith wrote: On Wed, 2011-01-12 at 22:02 -0500, Rik van Riel wrote: Cgroups only makes the matter worse - libvirt places each KVM guest into its own cgroup, so a VCPU will generally always be alone on its own per-cgroup, per-cpu runqueue! That can lead to pulling a VCPU onto our local CPU because we think we are alone, when in reality we share the CPU with others... How can that happen? If the task you're trying to accelerate isn't in your task group, the whole attempt should be a noop. Nono, all the VCPUs from the same guest are in the same cgroup. However, with 4 VCPUs and 4 physical CPUs, chances are that they're all alone on their own per-cpu, per-cgroup cfs_rq. However, each CPU might have other runnable processes in other cfs_rq sched entities. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Fwd: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function]
On Wed, Jan 5, 2011 at 5:41 PM, Peter Zijlstra pet...@infradead.org wrote: On Wed, 2011-01-05 at 00:38 +0100, Tommaso Cucinotta wrote: Il 04/01/2011 19:15, Dario Faggioli ha scritto: Forwarded Message From: Peter Zijlstraa.p.zijls...@chello.nl To: Rik van Rielr...@redhat.com Cc: Hillf Dantondhi...@gmail.com,kvm@vger.kernel.org, linux-ker...@vger.kernel.org, Avi Kivitia...@redhat.com, Srivatsa Vaddagiriva...@linux.vnet.ibm.com, Mike Galbraithefa...@gmx.de, Chris Wrightchr...@sous-sol.org Subject: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function Date: Tue, 04 Jan 2011 19:05:54 +0100 RT guests don't make sense, there's nowhere near enough infrastructure for that to work well. I'd argue that KVM running with RT priority is a bug. Peter, can I ask why did you state that ? In the IRMOS project, we are just deploying KVM VMs by using the Fabio's real-time scheduler (for others, a.k.a., the Fabio's EDF throttling patch, or IRMOS RT scheduler) in order to let the VMs get precise CPU scheduling guarantees by the kernel. So, in this context we do have KVM running at RT priority, and we do have experimental results showing how this can improve stability of performance of the hosted guest VMs. Of course, don't misunderstand me: this is a necessary condition for a stable performance of KVM VMs, I'm not saying it is sufficient for I was mostly referring to the existing RT cruft (SCHED_RR/FIFO), that's utterly useless for KVM. As to hosting vcpus with CBS this might maybe make sense, but RT-guests are still miles away. Anyway, I'm not quite sure how you would want to deal with the guest spinlock issue in CBS, ideally you'd use paravirt guests to avoid that whole problem. Anyway, /me goes do something useful, virt sucks and should be taken out back and shot in the head. I dont think we are now still in the track of the patch from Rik, in which Mike brought the yield_to method into scheduling. The focus, as I see, is mainly on the effectiveness of the new method, since it could also be utilized in other environments, though currently it has nothing to do with the RT cruft but aims at easing certain lock contention in KVM. Another issue is that the change in the fair scheduling class, accompanying the new method, is deserved, for any reason Rik hold. Lets please return to the patch, and defer the RT. thanks Hillf -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Thu, Jan 6, 2011 at 12:57 AM, Mike Galbraith efa...@gmx.de wrote: sched: Add yield_to(task, preempt) functionality. Currently only implemented for fair class tasks. Add a yield_to_task method() to the fair scheduling class. allowing the caller of yield_to() to accelerate another thread in it's thread group, task group, and sched class toward either it's cpu, or potentially the caller's own cpu if the 'preempt' argument is also passed. Implemented via a scheduler hint, using cfs_rq-next to encourage the target being selected. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Mike Galbraith efa...@gmx.de --- include/linux/sched.h | 1 kernel/sched.c | 56 ++ kernel/sched_fair.c | 52 ++ 3 files changed, 109 insertions(+) Index: linux-2.6/include/linux/sched.h === --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1056,6 +1056,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); + int (*yield_to_task) (struct task_struct *p, int preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -5327,6 +5327,62 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + */ +void __sched yield_to(struct task_struct *p, int preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + unsigned long flags; + int yield = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (!curr-sched_class-yield_to_task) + goto out; + + if (curr-sched_class != p-sched_class) + goto out; + to be clearer? if (task_running(p_rq, p) || p-state != TASK_RUNNING) + goto out; + + if (!same_thread_group(p, curr)) + goto out; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, 2011-01-05 at 18:04 +0100, Peter Zijlstra wrote: On Wed, 2011-01-05 at 17:57 +0100, Mike Galbraith wrote: + p_cfs_rq = cfs_rq_of(pse); + local = 1; + } +#endif + + /* Tell the scheduler that we'd really like pse to run next. */ + p_cfs_rq-next = pse; + + /* We know whether we want to preempt or not, but are we allowed? */ + preempt = same_thread_group(p, task_of(p_cfs_rq-curr)); + + if (local) + clear_buddies(cfs_rq, se); You might want to clear before setting next :-) Or better, just remove dept. of redundancy dept. cruft. We clear buddies upon selection. It's also pointless worrying whether to set TIF_RESCHED or not, no cycle savings to be had there methinks. While performing cruftectomy, also did cosmetic int == bool. sched: Add yield_to(task, preempt) functionality. Currently only implemented for fair class tasks. Add a yield_to_task method() to the fair scheduling class. allowing the caller of yield_to() to accelerate another thread in it's thread group, task group, and sched class toward either it's cpu, or potentially the caller's own cpu if the 'preempt' argument is also passed. Implemented via a scheduler hint, using cfs_rq-next to encourage the target being selected. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Mike Galbraith efa...@gmx.de --- include/linux/sched.h |1 kernel/sched.c| 56 ++ kernel/sched_fair.c | 44 +++ 3 files changed, 101 insertions(+) Index: linux-2.6/include/linux/sched.h === --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1056,6 +1056,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); + bool (*yield_to_task) (struct task_struct *p, bool preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -5327,6 +5327,62 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + */ +void __sched yield_to(struct task_struct *p, bool preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + unsigned long flags; + bool yield = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (!curr-sched_class-yield_to_task) + goto out; + + if (curr-sched_class != p-sched_class) + goto out; + + if (task_running(p_rq, p) || p-state) + goto out; + + if (!same_thread_group(p, curr)) + goto out; + +#ifdef CONFIG_FAIR_GROUP_SCHED + if (task_group(p) != task_group(curr)) + goto out; +#endif + + yield = curr-sched_class-yield_to_task(p, preempt); + +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + + if (yield) { + set_current_state(TASK_RUNNING); + schedule(); + } +} +EXPORT_SYMBOL_GPL(yield_to); + /* * This task is about to go to sleep on IO. Increment rq-nr_iowait so * that process accounting knows that this is a task in IO wait state. Index: linux-2.6/kernel/sched_fair.c === --- linux-2.6.orig/kernel/sched_fair.c +++ linux-2.6/kernel/sched_fair.c @@ -1337,6 +1337,49 @@ static void yield_task_fair(struct rq *r } #ifdef CONFIG_SMP +static void pull_task(struct rq *src_rq, struct task_struct *p, + struct rq *this_rq, int this_cpu); +#endif + +static bool yield_to_task_fair(struct task_struct *p, bool preempt) +{ + struct sched_entity *se = current-se; + struct sched_entity *pse = p-se; + struct cfs_rq *cfs_rq = cfs_rq_of(se); + struct cfs_rq *p_cfs_rq = cfs_rq_of(pse); + int this_cpu = smp_processor_id(); + + if (!pse-on_rq) + return false; + +#ifdef CONFIG_SMP + /* +* If this yield is important enough to want to preempt instead +* of only dropping a -next hint, we're alone, and the target +
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote: On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote: Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can avoid almost Nehalem (and other P2P cache arch) lock unfairness problem. (probaby creating pthread_condattr_setautoyield_np or similar knob is good one) Often, the thread calling pthread_cond_signal() wants to continue executing, not yield. Then, it doesn't work. After calling pthread_cond_signal(), T1 which cond_signal caller and T2 which waked start to GIL grab race. But usually T1 is always win because lock variable is in T1's cpu cache. Why kernel and userland have so much different result? One of a reason is glibc doesn't have any ticket lock scheme. If you are interesting GIL mess and issue, please feel free to ask more. I suggest looking into an explicit round-robin scheme, where each thread adds itself to a queue and an unlock wakes up the first waiter. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/05/2011 10:40 AM, KOSAKI Motohiro wrote: On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote: On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote: Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can avoid almost Nehalem (and other P2P cache arch) lock unfairness problem. (probaby creating pthread_condattr_setautoyield_np or similar knob is good one) Often, the thread calling pthread_cond_signal() wants to continue executing, not yield. Then, it doesn't work. After calling pthread_cond_signal(), T1 which cond_signal caller and T2 which waked start to GIL grab race. But usually T1 is always win because lock variable is in T1's cpu cache. Why kernel and userland have so much different result? One of a reason is glibc doesn't have any ticket lock scheme. If you are interesting GIL mess and issue, please feel free to ask more. I suggest looking into an explicit round-robin scheme, where each thread adds itself to a queue and an unlock wakes up the first waiter. I'm sure you haven't try your scheme. but I did. It's slow. Won't anything with a heavily contented global/giant lock be slow? What's the average lock hold time per thread? 10%? 50%? 90%? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/05/2011 10:40 AM, KOSAKI Motohiro wrote: On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote: On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote: Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can avoid almost Nehalem (and other P2P cache arch) lock unfairness problem. (probaby creating pthread_condattr_setautoyield_np or similar knob is good one) Often, the thread calling pthread_cond_signal() wants to continue executing, not yield. Then, it doesn't work. After calling pthread_cond_signal(), T1 which cond_signal caller and T2 which waked start to GIL grab race. But usually T1 is always win because lock variable is in T1's cpu cache. Why kernel and userland have so much different result? One of a reason is glibc doesn't have any ticket lock scheme. If you are interesting GIL mess and issue, please feel free to ask more. I suggest looking into an explicit round-robin scheme, where each thread adds itself to a queue and an unlock wakes up the first waiter. I'm sure you haven't try your scheme. but I did. It's slow. Won't anything with a heavily contented global/giant lock be slow? What's the average lock hold time per thread? 10%? 50%? 90%? Well, Of cource all of heavily contetion are slow. but we don't have to compare heavily contended with light contended. we have to compare heavily contended with heavily contended or light contended with light contended. If we are talking a scripting language VM, pipe benchmark show impressively FIFO overhead which like your propsed. Because pipe bench makes frequently GIL grab/ungrab storm. Similar to pipe bench showed our (very) old kernel's bottleneck. Sadly userspace have no way to implement per-cpu runqueue. I think. And, if we are talking a language VM, I can't say any average time. It depend on running script. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/05/2011 11:30 AM, KOSAKI Motohiro wrote: On 01/05/2011 10:40 AM, KOSAKI Motohiro wrote: On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote: On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote: Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can avoid almost Nehalem (and other P2P cache arch) lock unfairness problem. (probaby creating pthread_condattr_setautoyield_np or similar knob is good one) Often, the thread calling pthread_cond_signal() wants to continue executing, not yield. Then, it doesn't work. After calling pthread_cond_signal(), T1 which cond_signal caller and T2 which waked start to GIL grab race. But usually T1 is always win because lock variable is in T1's cpu cache. Why kernel and userland have so much different result? One of a reason is glibc doesn't have any ticket lock scheme. If you are interesting GIL mess and issue, please feel free to ask more. I suggest looking into an explicit round-robin scheme, where each thread adds itself to a queue and an unlock wakes up the first waiter. I'm sure you haven't try your scheme. but I did. It's slow. Won't anything with a heavily contented global/giant lock be slow? What's the average lock hold time per thread? 10%? 50%? 90%? Well, Of cource all of heavily contetion are slow. but we don't have to compare heavily contended with light contended. we have to compare heavily contended with heavily contended or light contended with light contended. If we are talking a scripting language VM, pipe benchmark show impressively FIFO overhead which like your propsed. Because pipe bench makes frequently GIL grab/ungrab storm. Similar to pipe bench showed our (very) old kernel's bottleneck. Sadly userspace have no way to implement per-cpu runqueue. I think. A completely fair lock will likely be slower than an unfair lock. And, if we are talking a language VM, I can't say any average time. It depend on running script. Pick some parallel compute intensive script, please. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Fwd: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function]
On Wed, 2011-01-05 at 00:38 +0100, Tommaso Cucinotta wrote: Il 04/01/2011 19:15, Dario Faggioli ha scritto: Forwarded Message From: Peter Zijlstraa.p.zijls...@chello.nl To: Rik van Rielr...@redhat.com Cc: Hillf Dantondhi...@gmail.com,kvm@vger.kernel.org, linux-ker...@vger.kernel.org, Avi Kivitia...@redhat.com, Srivatsa Vaddagiriva...@linux.vnet.ibm.com, Mike Galbraithefa...@gmx.de, Chris Wrightchr...@sous-sol.org Subject: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function Date: Tue, 04 Jan 2011 19:05:54 +0100 RT guests don't make sense, there's nowhere near enough infrastructure for that to work well. I'd argue that KVM running with RT priority is a bug. Peter, can I ask why did you state that ? In the IRMOS project, we are just deploying KVM VMs by using the Fabio's real-time scheduler (for others, a.k.a., the Fabio's EDF throttling patch, or IRMOS RT scheduler) in order to let the VMs get precise CPU scheduling guarantees by the kernel. So, in this context we do have KVM running at RT priority, and we do have experimental results showing how this can improve stability of performance of the hosted guest VMs. Of course, don't misunderstand me: this is a necessary condition for a stable performance of KVM VMs, I'm not saying it is sufficient for I was mostly referring to the existing RT cruft (SCHED_RR/FIFO), that's utterly useless for KVM. As to hosting vcpus with CBS this might maybe make sense, but RT-guests are still miles away. Anyway, I'm not quite sure how you would want to deal with the guest spinlock issue in CBS, ideally you'd use paravirt guests to avoid that whole problem. Anyway, /me goes do something useful, virt sucks and should be taken out back and shot in the head. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, 2011-01-05 at 11:39 +0900, KOSAKI Motohiro wrote: After calling pthread_cond_signal(), T1 which cond_signal caller and T2 which waked start to GIL grab race. But usually T1 is always win because lock variable is in T1's cpu cache. Why kernel and userland have so much different result? One of a reason is glibc doesn't have any ticket lock scheme. The problem is that making locks strictly fair is that that sucks for performance, iirc most futex ops are fifo-ordered when they his the block path, but we do allow for lock-stealing. Lock-stealing greatly improves performance since it avoids lots of block/wakeup cycles, but it does make things unfair. I'm not sure we have a futex option to disable lock-stealing, nor do I think you really want to, performance suffers really badly. [This btw is the reason why people reported a performance improvement when they wrapped all mmap() calls in a pthread_mutex, the rwsem_down_write() thing doesn't allow for lock-stealing since it needs to be strict fifo-fair] -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, 2011-01-05 at 17:40 +0900, KOSAKI Motohiro wrote: If you are interesting GIL mess and issue, please feel free to ask more. I suggest looking into an explicit round-robin scheme, where each thread adds itself to a queue and an unlock wakes up the first waiter. I'm sure you haven't try your scheme. but I did. It's slow. Of course it is, but then your locking scheme (GIL) is the problem, not the locking constructs. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
It's Rik's patch to do with whatever he wants (I donated suggestion;), but I took the liberty of cleaning it up a bit, see below. It's essentially the same, but fixes a bug where the caller may have targeted a task in it's thread group etc, but when preemption decision time came, may have preempted a forbidden task. On Tue, 2011-01-04 at 19:04 +0100, Peter Zijlstra wrote: On Mon, 2011-01-03 at 16:29 -0500, Rik van Riel wrote: +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + + if (yield) { + set_current_state(TASK_RUNNING); + schedule(); + } +} +EXPORT_SYMBOL(yield_to); This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible I'd make it so only kvm.o could use it. It really sucks that kvm is a module. Did that. +#ifdef CONFIG_FAIR_GROUP_SCHED + if (cfs_rq-tg != p_cfs_rq-tg) + return 0; + + /* Preemption only allowed within the same task group. */ + if (preempt cfs_rq-tg != cfs_rq_of(curr)-tg) + preempt = 0; +#endif I'd simply bail if its not the same cgroup, who cares about that case anyway, all KVM vcpu threads should be in the same cgroup I think. See below. + /* Preemption only allowed within the same thread group. */ + if (preempt !same_thread_group(current, task_of(p_cfs_rq-curr))) + preempt = 0; The calling site already checks for same_thread_group(), we never even get here if that's not the case. Yeah, task group check was over there already as well. Hasty hack. +#ifdef CONFIG_SMP + /* +* If this yield is important enough to want to preempt instead +* of only dropping a -next hint, we're alone, and the target +* is not alone, pull the target to this cpu. +*/ + if (want_preempt !yield cfs_rq-nr_running == 1 + cpumask_test_cpu(smp_processor_id(), p-cpus_allowed)) { + pull_task(task_rq(p), p, this_rq(), smp_processor_id()); This only works by the grace that the caller checked p-se.on_rq. A comment might be in order. I moved all of the entity decisions to the class method, and tried to make it a bit less ugly. sched: Add yield_to(task, preempt) functionality. Currently only implemented for fair class tasks. Add a yield_to_task method() to the fair scheduling class. allowing the caller of yield_to() to accelerate another thread in it's thread group, task group, and sched class toward either it's cpu, or potentially the caller's own cpu if the 'preempt' argument is also passed. Implemented via a scheduler hint, using cfs_rq-next to encourage the target being selected. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Mike Galbraith efa...@gmx.de --- include/linux/sched.h |1 kernel/sched.c| 56 ++ kernel/sched_fair.c | 52 ++ 3 files changed, 109 insertions(+) Index: linux-2.6/include/linux/sched.h === --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1056,6 +1056,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); + int (*yield_to_task) (struct task_struct *p, int preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -5327,6 +5327,62 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + */ +void __sched yield_to(struct task_struct *p, int preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + unsigned long flags; + int yield = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (!curr-sched_class-yield_to_task) + goto out; + + if (curr-sched_class != p-sched_class) + goto out; + + if (task_running(p_rq, p) || p-state) + goto out; + + if (!same_thread_group(p, curr)) + goto out; + +#ifdef CONFIG_FAIR_GROUP_SCHED + if (task_group(p) != task_group(curr)) + goto out; +#endif + + yield
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, 2011-01-05 at 17:57 +0100, Mike Galbraith wrote: + p_cfs_rq = cfs_rq_of(pse); + local = 1; + } +#endif + + /* Tell the scheduler that we'd really like pse to run next. */ + p_cfs_rq-next = pse; + + /* We know whether we want to preempt or not, but are we allowed? */ + preempt = same_thread_group(p, task_of(p_cfs_rq-curr)); + + if (local) + clear_buddies(cfs_rq, se); You might want to clear before setting next :-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/04/2011 08:04 PM, Peter Zijlstra wrote: This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible I'd make it so only kvm.o could use it. It really sucks that kvm is a module. Why does it suck? I mean apart from the virtualization is crap song. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/05/2011 07:15 PM, Peter Zijlstra wrote: On Wed, 2011-01-05 at 19:10 +0200, Avi Kivity wrote: On 01/04/2011 08:04 PM, Peter Zijlstra wrote: This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible I'd make it so only kvm.o could use it. It really sucks that kvm is a module. Why does it suck? I mean apart from the virtualization is crap song. Because it needs hooks all over the core kernel, like this yield_to() stuff. Exporting this might lead to others wanting to use it too. Well, it's very convenient for development (modprobe vs. reboot). What hooks do you object to? mmu notifiers are useful for some drivers, sched notifiers are useful for cmwq and possibly perf? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, 2011-01-05 at 18:04 +0100, Peter Zijlstra wrote: On Wed, 2011-01-05 at 17:57 +0100, Mike Galbraith wrote: + p_cfs_rq = cfs_rq_of(pse); + local = 1; + } +#endif + + /* Tell the scheduler that we'd really like pse to run next. */ + p_cfs_rq-next = pse; + + /* We know whether we want to preempt or not, but are we allowed? */ + preempt = same_thread_group(p, task_of(p_cfs_rq-curr)); + + if (local) + clear_buddies(cfs_rq, se); You might want to clear before setting next :-) Heh, you made me thwap forehead.. but pse != se or p_cfs_rq-curr. Moved it back where it came from anyway ;-) sched: Add yield_to(task, preempt) functionality. Currently only implemented for fair class tasks. Add a yield_to_task method() to the fair scheduling class. allowing the caller of yield_to() to accelerate another thread in it's thread group, task group, and sched class toward either it's cpu, or potentially the caller's own cpu if the 'prempt' argument is also passed. Implemented via a scheduler hint, using cfs_rq-next to encourage the target being selected. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Mike Galbraith efa...@gmx.de --- include/linux/sched.h |1 kernel/sched.c| 56 ++ kernel/sched_fair.c | 52 ++ 3 files changed, 109 insertions(+) Index: linux-2.6/include/linux/sched.h === --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1056,6 +1056,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); + int (*yield_to_task) (struct task_struct *p, int preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -5327,6 +5327,62 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + */ +void __sched yield_to(struct task_struct *p, int preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + unsigned long flags; + int yield = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (!curr-sched_class-yield_to_task) + goto out; + + if (curr-sched_class != p-sched_class) + goto out; + + if (task_running(p_rq, p) || p-state) + goto out; + + if (!same_thread_group(p, curr)) + goto out; + +#ifdef CONFIG_FAIR_GROUP_SCHED + if (task_group(p) != task_group(curr)) + goto out; +#endif + + yield = curr-sched_class-yield_to_task(p, preempt); + +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + + if (yield) { + set_current_state(TASK_RUNNING); + schedule(); + } +} +EXPORT_SYMBOL_GPL(yield_to); + /* * This task is about to go to sleep on IO. Increment rq-nr_iowait so * that process accounting knows that this is a task in IO wait state. Index: linux-2.6/kernel/sched_fair.c === --- linux-2.6.orig/kernel/sched_fair.c +++ linux-2.6/kernel/sched_fair.c @@ -1337,6 +1337,57 @@ static void yield_task_fair(struct rq *r } #ifdef CONFIG_SMP +static void pull_task(struct rq *src_rq, struct task_struct *p, + struct rq *this_rq, int this_cpu); +#endif + +static int yield_to_task_fair(struct task_struct *p, int preempt) +{ + struct sched_entity *se = current-se; + struct sched_entity *pse = p-se; + struct cfs_rq *cfs_rq = cfs_rq_of(se); + struct cfs_rq *p_cfs_rq = cfs_rq_of(pse); + int local = cfs_rq == p_cfs_rq; + int this_cpu = smp_processor_id(); + + if (!pse-on_rq) + return 0; + +#ifdef CONFIG_SMP + /* +* If this yield is important enough to want to preempt instead +* of only dropping a -next hint, we're alone, and the target +* is not alone, pull the target to this cpu. +* +* NOTE: the target may be alone in it's
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, 2011-01-05 at 19:19 +0200, Avi Kivity wrote: On 01/05/2011 07:15 PM, Peter Zijlstra wrote: On Wed, 2011-01-05 at 19:10 +0200, Avi Kivity wrote: On 01/04/2011 08:04 PM, Peter Zijlstra wrote: This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible I'd make it so only kvm.o could use it. It really sucks that kvm is a module. Why does it suck? I mean apart from the virtualization is crap song. Because it needs hooks all over the core kernel, like this yield_to() stuff. Exporting this might lead to others wanting to use it too. Well, it's very convenient for development (modprobe vs. reboot). Get a machine that boots fast ;-), it might be because I've never worked on anything modular, but typically stuff crashes when I got it wrong, in which case you're forced to power-cycle anyway. What hooks do you object to? Don't really keep a list, but every now and again see something that makes me think that ought not be exported. mmu notifiers are useful for some drivers, sched notifiers are useful for cmwq and possibly perf? Both use explicit function calls because its easier to read the code and faster. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/05/2011 07:28 PM, Peter Zijlstra wrote: On Wed, 2011-01-05 at 19:19 +0200, Avi Kivity wrote: On 01/05/2011 07:15 PM, Peter Zijlstra wrote: On Wed, 2011-01-05 at 19:10 +0200, Avi Kivity wrote: On 01/04/2011 08:04 PM, Peter Zijlstra wrote: This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible I'd make it so only kvm.o could use it. It really sucks that kvm is a module. Why does it suck? I mean apart from the virtualization is crap song. Because it needs hooks all over the core kernel, like this yield_to() stuff. Exporting this might lead to others wanting to use it too. Well, it's very convenient for development (modprobe vs. reboot). Get a machine that boots fast ;-), it might be because I've never worked on anything modular, but typically stuff crashes when I got it wrong, in which case you're forced to power-cycle anyway. I often get by with just guest crashes. Or maybe it oopses, I look around, then reboot at my leisure. What hooks do you object to? Don't really keep a list, but every now and again see something that makes me think that ought not be exported. mmu notifiers are useful for some drivers, sched notifiers are useful for cmwq and possibly perf? Both use explicit function calls because its easier to read the code and faster. I doubt it's measurable, especially for perf with the msr tweaking it does. Tejun, why did you end up not using preempt_notifiers in cmwq? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, 2011-01-05 at 19:35 +0200, Avi Kivity wrote: Tejun, why did you end up not using preempt_notifiers in cmwq? Because I told him to use explicit function calls because that keeps the code easier to read. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, 2011-01-05 at 19:10 +0200, Avi Kivity wrote: On 01/04/2011 08:04 PM, Peter Zijlstra wrote: This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible I'd make it so only kvm.o could use it. It really sucks that kvm is a module. Why does it suck? I mean apart from the virtualization is crap song. Because it needs hooks all over the core kernel, like this yield_to() stuff. Exporting this might lead to others wanting to use it too. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote: Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can avoid almost Nehalem (and other P2P cache arch) lock unfairness problem. (probaby creating pthread_condattr_setautoyield_np or similar knob is good one) Often, the thread calling pthread_cond_signal() wants to continue executing, not yield. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Tue, Jan 4, 2011 at 5:29 AM, Rik van Riel r...@redhat.com wrote: From: Mike Galbraith efa...@gmx.de Add a yield_to function to the scheduler code, allowing us to give enough of our timeslice to another thread to allow it to run and release whatever resource we need it to release. We may want to use this to provide a sys_yield_to system call one day. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Not-signed-off-by: Mike Galbraith efa...@gmx.de --- Mike, want to change the above into a Signed-off-by: ? :) This code seems to work well. diff --git a/include/linux/sched.h b/include/linux/sched.h index c5f926c..0b8a3e6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1083,6 +1083,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep); void (*yield_task) (struct rq *rq); + int (*yield_to_task) (struct task_struct *p, int preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); @@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p) # define rt_mutex_adjust_pi(p) do { } while (0) #endif +extern void yield_to(struct task_struct *p, int preempt); extern void set_user_nice(struct task_struct *p, long nice); extern int task_prio(const struct task_struct *p); extern int task_nice(const struct task_struct *p); diff --git a/kernel/sched.c b/kernel/sched.c index f8e5a25..ffa7a9d 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -6901,6 +6901,53 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + */ +void __sched yield_to(struct task_struct *p, int preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + unsigned long flags; + int yield = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (task_running(p_rq, p) || p-state || !p-se.on_rq || + !same_thread_group(p, curr) || + !curr-sched_class-yield_to_task || + curr-sched_class != p-sched_class) { + goto out; + } + + yield = curr-sched_class-yield_to_task(p, preempt); + +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + + if (yield) { + set_current_state(TASK_RUNNING); + schedule(); + } +} +EXPORT_SYMBOL(yield_to); + + /* * This task is about to go to sleep on IO. Increment rq-nr_iowait so * that process accounting knows that this is a task in IO wait state. diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5119b08..3288e7c 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq) } #ifdef CONFIG_SMP +static void pull_task(struct rq *src_rq, struct task_struct *p, + struct rq *this_rq, int this_cpu); +#endif + +static int yield_to_task_fair(struct task_struct *p, int preempt) +{ + struct sched_entity *se = current-se; + struct sched_entity *pse = p-se; + struct sched_entity *curr = (task_rq(p)-curr)-se; + struct cfs_rq *cfs_rq = cfs_rq_of(se); + struct cfs_rq *p_cfs_rq = cfs_rq_of(pse); + int yield = this_rq() == task_rq(p); + int want_preempt = preempt; + +#ifdef CONFIG_FAIR_GROUP_SCHED + if (cfs_rq-tg != p_cfs_rq-tg) + return 0; + + /* Preemption only allowed within the same task group. */ + if (preempt cfs_rq-tg != cfs_rq_of(curr)-tg) + preempt = 0; +#endif + /* Preemption only allowed within the same thread group. */ + if (preempt !same_thread_group(current, task_of(p_cfs_rq-curr))) + preempt = 0; + +#ifdef CONFIG_SMP + /* + * If this yield is important enough to want to preempt instead + * of only dropping a -next hint, we're alone, and the target + * is not alone, pull the target to this cpu. + */ + if (want_preempt !yield cfs_rq-nr_running == 1 + cpumask_test_cpu(smp_processor_id(), p-cpus_allowed)) { + pull_task(task_rq(p), p, this_rq(), smp_processor_id()); + p_cfs_rq = cfs_rq_of(pse); + yield = 1; + } +#endif + + if
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Tue, Jan 4, 2011 at 5:29 AM, Rik van Riel r...@redhat.com wrote: From: Mike Galbraith efa...@gmx.de Add a yield_to function to the scheduler code, allowing us to give enough of our timeslice to another thread to allow it to run and release whatever resource we need it to release. We may want to use this to provide a sys_yield_to system call one day. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Not-signed-off-by: Mike Galbraith efa...@gmx.de --- Mike, want to change the above into a Signed-off-by: ? :) This code seems to work well. diff --git a/include/linux/sched.h b/include/linux/sched.h index c5f926c..0b8a3e6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1083,6 +1083,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep); void (*yield_task) (struct rq *rq); + int (*yield_to_task) (struct task_struct *p, int preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); @@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p) # define rt_mutex_adjust_pi(p) do { } while (0) #endif +extern void yield_to(struct task_struct *p, int preempt); extern void set_user_nice(struct task_struct *p, long nice); extern int task_prio(const struct task_struct *p); extern int task_nice(const struct task_struct *p); diff --git a/kernel/sched.c b/kernel/sched.c index f8e5a25..ffa7a9d 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -6901,6 +6901,53 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + */ +void __sched yield_to(struct task_struct *p, int preempt) +{ + struct task_struct *curr = current; struct task_struct *next; + struct rq *rq, *p_rq; + unsigned long flags; + int yield = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (task_running(p_rq, p) || p-state || !p-se.on_rq || + !same_thread_group(p, curr) || /* !curr-sched_class-yield_to_task ||*/ + curr-sched_class != p-sched_class) { + goto out; + } + /* * ask scheduler to compute the next for successfully kicking @p onto its CPU * what if p_rq is rt_class to do? */ next = pick_next_task(p_rq); if (next != p) p-se.vruntime = next-se.vruntime -1; deactivate_task(p_rq, p, 0); activate_task(p_rq, p, 0); if (rq == p_rq) schedule(); else resched_task(p_rq-curr); yield = 0; /* yield = curr-sched_class-yield_to_task(p, preempt); */ + +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + + if (yield) { + set_current_state(TASK_RUNNING); + schedule(); + } +} +EXPORT_SYMBOL(yield_to); + + /* * This task is about to go to sleep on IO. Increment rq-nr_iowait so * that process accounting knows that this is a task in IO wait state. diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5119b08..3288e7c 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq) } #ifdef CONFIG_SMP +static void pull_task(struct rq *src_rq, struct task_struct *p, + struct rq *this_rq, int this_cpu); +#endif + +static int yield_to_task_fair(struct task_struct *p, int preempt) +{ + struct sched_entity *se = current-se; + struct sched_entity *pse = p-se; + struct sched_entity *curr = (task_rq(p)-curr)-se; + struct cfs_rq *cfs_rq = cfs_rq_of(se); + struct cfs_rq *p_cfs_rq = cfs_rq_of(pse); + int yield = this_rq() == task_rq(p); + int want_preempt = preempt; + +#ifdef CONFIG_FAIR_GROUP_SCHED + if (cfs_rq-tg != p_cfs_rq-tg) + return 0; + + /* Preemption only allowed within the same task group. */ + if (preempt cfs_rq-tg != cfs_rq_of(curr)-tg) + preempt = 0; +#endif + /* Preemption only allowed within the same thread group. */ + if (preempt !same_thread_group(current, task_of(p_cfs_rq-curr))) + preempt = 0; + +#ifdef CONFIG_SMP + /* + * If this
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/04/2011 11:41 AM, Hillf Danton wrote: /* !curr-sched_class-yield_to_task ||*/ + curr-sched_class != p-sched_class) { + goto out; + } + /* * ask scheduler to compute the next for successfully kicking @p onto its CPU * what if p_rq is rt_class to do? */ next = pick_next_task(p_rq); if (next != p) p-se.vruntime = next-se.vruntime -1; deactivate_task(p_rq, p, 0); activate_task(p_rq, p, 0); if (rq == p_rq) schedule(); else resched_task(p_rq-curr); yield = 0; Wouldn't that break for FIFO and RR tasks? There's a reason all the scheduler folks wanted a per-class yield_to_task function :) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, Jan 5, 2011 at 12:44 AM, Rik van Riel r...@redhat.com wrote: On 01/04/2011 11:41 AM, Hillf Danton wrote: /* !curr-sched_class-yield_to_task || */ + curr-sched_class != p-sched_class) { + goto out; + } + /* * ask scheduler to compute the next for successfully kicking @p onto its CPU * what if p_rq is rt_class to do? */ next = pick_next_task(p_rq); if (next != p) p-se.vruntime = next-se.vruntime -1; deactivate_task(p_rq, p, 0); activate_task(p_rq, p, 0); if (rq == p_rq) schedule(); else resched_task(p_rq-curr); yield = 0; Wouldn't that break for FIFO and RR tasks? There's a reason all the scheduler folks wanted a per-class yield_to_task function :) Where is the yield_to callback in the patch for RT schedule class? If @p is RT, what could you do? Hillf
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/04/2011 11:51 AM, Hillf Danton wrote: Wouldn't that break for FIFO and RR tasks? There's a reason all the scheduler folks wanted a per-class yield_to_task function :) Where is the yield_to callback in the patch for RT schedule class? If @p is RT, what could you do? If the user chooses to overcommit the CPU with realtime tasks, the user cannot expect realtime response. For realtime, I have not implemented the yield_to callback at all because it would probably break realtime semantics and I assume people will not overcommit the CPU with realtime tasks anyway. I could see running a few realtime guests on a system, with the number of realtime VCPUs not exceeding the number of physical CPUs. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, Jan 5, 2011 at 12:54 AM, Rik van Riel r...@redhat.com wrote: On 01/04/2011 11:51 AM, Hillf Danton wrote: Wouldn't that break for FIFO and RR tasks? There's a reason all the scheduler folks wanted a per-class yield_to_task function :) Where is the yield_to callback in the patch for RT schedule class? If @p is RT, what could you do? If the user chooses to overcommit the CPU with realtime tasks, the user cannot expect realtime response. For realtime, I have not implemented the yield_to callback at all because it would probably break realtime semantics and I assume people will not overcommit the CPU with realtime tasks anyway. I could see running a few realtime guests on a system, with the number of realtime VCPUs not exceeding the number of physical CPUs. Then it looks curr-sched_class != p-sched_class is not enough, and yield_to can not ease the lock contention in KVM in case where p-rq-curr is RT. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote: Where is the yield_to callback in the patch for RT schedule class? If @p is RT, what could you do? RT guests are a pipe dream, you first need to get the hypervisor (kvm in this case) to be RT, which it isn't. Then you either need to very statically set-up the host and the guest scheduling constraints (not possible with RR/FIFO) or have a complete paravirt RT scheduler which communicates its requirements to the host. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, Jan 5, 2011 at 1:08 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote: Where is the yield_to callback in the patch for RT schedule class? If @p is RT, what could you do? RT guests are a pipe dream, you first need to get the hypervisor (kvm in this case) to be RT, which it isn't. Then you either need to very statically set-up the host and the guest scheduling constraints (not possible with RR/FIFO) or have a complete paravirt RT scheduler which communicates its requirements to the host. Even guest is not RT, you could not prevent it from being preempted by RT task which has nothing to do guests. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Tue, 2011-01-04 at 15:14 +0900, KOSAKI Motohiro wrote: From: Mike Galbraith efa...@gmx.de Add a yield_to function to the scheduler code, allowing us to give enough of our timeslice to another thread to allow it to run and release whatever resource we need it to release. We may want to use this to provide a sys_yield_to system call one day. At least I want. Ruby has GIL(giant interpreter lock). And giant lock naturally enforce an app to implement cooperative multithreading model. Therefore it has similar problem with your one. Python solved this issue by slowing lock mechanism (two pthread-cond wakeup each GIL releasing), but I don't want it. Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can avoid almost Nehalem (and other P2P cache arch) lock unfairness problem. (probaby creating pthread_condattr_setautoyield_np or similar knob is good one) NAK NAK NAK, yield_to is utter crap, and the only reason kvm 'needs' it is because its wants to be utter crap (run unmodified guests). There is plenty of sane serialization primitives for userspace, fix your locking mess instead of pushing crap. The only reason I'm maybe half-way considering this is because its a pure in-kernel interface which we can 'fix' once unmodified guests aren't important anymore. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, 2011-01-05 at 01:12 +0800, Hillf Danton wrote: On Wed, Jan 5, 2011 at 1:08 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote: Where is the yield_to callback in the patch for RT schedule class? If @p is RT, what could you do? RT guests are a pipe dream, you first need to get the hypervisor (kvm in this case) to be RT, which it isn't. Then you either need to very statically set-up the host and the guest scheduling constraints (not possible with RR/FIFO) or have a complete paravirt RT scheduler which communicates its requirements to the host. Even guest is not RT, you could not prevent it from being preempted by RT task which has nothing to do guests. Sure, but yield_to() being a NOP for RT tasks is perfectly fine. Pretty much all yield*() usage is a bug anyway. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/04/2011 12:08 PM, Peter Zijlstra wrote: On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote: Where is the yield_to callback in the patch for RT schedule class? If @p is RT, what could you do? RT guests are a pipe dream, you first need to get the hypervisor (kvm in this case) to be RT, which it isn't. Then you either need to very statically set-up the host and the guest scheduling constraints (not possible with RR/FIFO) or have a complete paravirt RT scheduler which communicates its requirements to the host. There's a limited use case. One host can have a few RT guests, say a host with 8 CPUs can have up to 6 or 7 RT VCPUs. Those guests get top priority. The host can then have some other, low priority, guests that scavenge remaining CPU time. In this case, no yield_to is required for RT guests, because they do not do overcommit. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Mon, 2011-01-03 at 16:29 -0500, Rik van Riel wrote: From: Mike Galbraith efa...@gmx.de Add a yield_to function to the scheduler code, allowing us to give enough of our timeslice to another thread to allow it to run and release whatever resource we need it to release. We may want to use this to provide a sys_yield_to system call one day. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Not-signed-off-by: Mike Galbraith efa...@gmx.de --- Mike, want to change the above into a Signed-off-by: ? :) This code seems to work well. diff --git a/include/linux/sched.h b/include/linux/sched.h index c5f926c..0b8a3e6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1083,6 +1083,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep); void (*yield_task) (struct rq *rq); + int (*yield_to_task) (struct task_struct *p, int preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); @@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p) # define rt_mutex_adjust_pi(p) do { } while (0) #endif +extern void yield_to(struct task_struct *p, int preempt); extern void set_user_nice(struct task_struct *p, long nice); extern int task_prio(const struct task_struct *p); extern int task_nice(const struct task_struct *p); diff --git a/kernel/sched.c b/kernel/sched.c index f8e5a25..ffa7a9d 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -6901,6 +6901,53 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + */ +void __sched yield_to(struct task_struct *p, int preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + unsigned long flags; + int yield = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (task_running(p_rq, p) || p-state || !p-se.on_rq || + !same_thread_group(p, curr) || + !curr-sched_class-yield_to_task || + curr-sched_class != p-sched_class) { + goto out; + } + + yield = curr-sched_class-yield_to_task(p, preempt); + +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + + if (yield) { + set_current_state(TASK_RUNNING); + schedule(); + } +} +EXPORT_SYMBOL(yield_to); This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible I'd make it so only kvm.o could use it. It really sucks that kvm is a module. /* * This task is about to go to sleep on IO. Increment rq-nr_iowait so * that process accounting knows that this is a task in IO wait state. diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5119b08..3288e7c 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq) } #ifdef CONFIG_SMP +static void pull_task(struct rq *src_rq, struct task_struct *p, + struct rq *this_rq, int this_cpu); +#endif + +static int yield_to_task_fair(struct task_struct *p, int preempt) +{ + struct sched_entity *se = current-se; + struct sched_entity *pse = p-se; + struct sched_entity *curr = (task_rq(p)-curr)-se; + struct cfs_rq *cfs_rq = cfs_rq_of(se); + struct cfs_rq *p_cfs_rq = cfs_rq_of(pse); + int yield = this_rq() == task_rq(p); + int want_preempt = preempt; + +#ifdef CONFIG_FAIR_GROUP_SCHED + if (cfs_rq-tg != p_cfs_rq-tg) + return 0; + + /* Preemption only allowed within the same task group. */ + if (preempt cfs_rq-tg != cfs_rq_of(curr)-tg) + preempt = 0; +#endif I'd simply bail if its not the same cgroup, who cares about that case anyway, all KVM vcpu threads should be in the same cgroup I think. + /* Preemption only allowed within the same thread group. */ + if (preempt !same_thread_group(current, task_of(p_cfs_rq-curr))) + preempt = 0; The calling site already checks for same_thread_group(), we never even get here if that's not the case. +#ifdef CONFIG_SMP + /* + * If this yield is important enough to want to preempt instead + * of only dropping a -next hint, we're alone, and the target + * is not alone, pull the target to this cpu. + */ + if (want_preempt
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Tue, 2011-01-04 at 19:04 +0100, Peter Zijlstra wrote: + p_cfs_rq = cfs_rq_of(pse); + yield = 1; + } +#endif + + if (yield) + clear_buddies(cfs_rq, se); + else if (preempt) + clear_buddies(p_cfs_rq, curr); + + /* Tell the scheduler that we'd really like pse to run next. */ + p_cfs_rq-next = pse; + + if (!yield preempt) + resched_task(task_of(p_cfs_rq-curr)); I don't get this.. Why would you resched the remote cpu, surely you didn't just pull its current task over.. :) hope not. Caller bails on task_running(p_rq, p). wants a comment too I suppose. Target doesn't live here, preempt is still set/allowed, so we want the remote cpu to schedule. -Mike -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Fwd: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function]
Il 04/01/2011 19:15, Dario Faggioli ha scritto: Forwarded Message From: Peter Zijlstraa.p.zijls...@chello.nl To: Rik van Rielr...@redhat.com Cc: Hillf Dantondhi...@gmail.com,kvm@vger.kernel.org, linux-ker...@vger.kernel.org, Avi Kivitia...@redhat.com, Srivatsa Vaddagiriva...@linux.vnet.ibm.com, Mike Galbraithefa...@gmx.de, Chris Wrightchr...@sous-sol.org Subject: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function Date: Tue, 04 Jan 2011 19:05:54 +0100 RT guests don't make sense, there's nowhere near enough infrastructure for that to work well. I'd argue that KVM running with RT priority is a bug. Peter, can I ask why did you state that ? In the IRMOS project, we are just deploying KVM VMs by using the Fabio's real-time scheduler (for others, a.k.a., the Fabio's EDF throttling patch, or IRMOS RT scheduler) in order to let the VMs get precise CPU scheduling guarantees by the kernel. So, in this context we do have KVM running at RT priority, and we do have experimental results showing how this can improve stability of performance of the hosted guest VMs. Of course, don't misunderstand me: this is a necessary condition for a stable performance of KVM VMs, I'm not saying it is sufficient for -- Tommaso Cucinotta, Computer Engineering PhD, Researcher ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy Tel +39 050 882 024, Fax +39 050 882 003 http://retis.sssup.it/people/tommaso -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Fwd: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function]
Il 05/01/2011 00:38, Tommaso Cucinotta ha scritto: Of course, don't misunderstand me: this is a necessary condition for a stable performance of KVM VMs, I'm not saying it is sufficient for ... it. Please, comment on this (reply to all, please, I'm not following LKML). Thanks, Tommaso -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote: Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can avoid almost Nehalem (and other P2P cache arch) lock unfairness problem. (probaby creating pthread_condattr_setautoyield_np or similar knob is good one) Often, the thread calling pthread_cond_signal() wants to continue executing, not yield. Then, it doesn't work. After calling pthread_cond_signal(), T1 which cond_signal caller and T2 which waked start to GIL grab race. But usually T1 is always win because lock variable is in T1's cpu cache. Why kernel and userland have so much different result? One of a reason is glibc doesn't have any ticket lock scheme. If you are interesting GIL mess and issue, please feel free to ask more. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
NAK NAK NAK, yield_to is utter crap, and the only reason kvm 'needs' it is because its wants to be utter crap (run unmodified guests). There is plenty of sane serialization primitives for userspace, fix your locking mess instead of pushing crap. The only reason I'm maybe half-way considering this is because its a pure in-kernel interface which we can 'fix' once unmodified guests aren't important anymore. KVM also have slow and inefficient mecanism. Userspace have a lot of slow primitives too. But okey, I'm withdraw this claim even though I disagree userspace have plenty of sane serialization. It's offtopic and we need more discuss at proper place. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC -v3 PATCH 2/3] sched: add yield_to function
From: Mike Galbraith efa...@gmx.de Add a yield_to function to the scheduler code, allowing us to give enough of our timeslice to another thread to allow it to run and release whatever resource we need it to release. We may want to use this to provide a sys_yield_to system call one day. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Not-signed-off-by: Mike Galbraith efa...@gmx.de --- Mike, want to change the above into a Signed-off-by: ? :) This code seems to work well. diff --git a/include/linux/sched.h b/include/linux/sched.h index c5f926c..0b8a3e6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1083,6 +1083,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep); void (*yield_task) (struct rq *rq); + int (*yield_to_task) (struct task_struct *p, int preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); @@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p) # define rt_mutex_adjust_pi(p) do { } while (0) #endif +extern void yield_to(struct task_struct *p, int preempt); extern void set_user_nice(struct task_struct *p, long nice); extern int task_prio(const struct task_struct *p); extern int task_nice(const struct task_struct *p); diff --git a/kernel/sched.c b/kernel/sched.c index f8e5a25..ffa7a9d 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -6901,6 +6901,53 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + */ +void __sched yield_to(struct task_struct *p, int preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + unsigned long flags; + int yield = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (task_running(p_rq, p) || p-state || !p-se.on_rq || + !same_thread_group(p, curr) || + !curr-sched_class-yield_to_task || + curr-sched_class != p-sched_class) { + goto out; + } + + yield = curr-sched_class-yield_to_task(p, preempt); + +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + + if (yield) { + set_current_state(TASK_RUNNING); + schedule(); + } +} +EXPORT_SYMBOL(yield_to); + + /* * This task is about to go to sleep on IO. Increment rq-nr_iowait so * that process accounting knows that this is a task in IO wait state. diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5119b08..3288e7c 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq) } #ifdef CONFIG_SMP +static void pull_task(struct rq *src_rq, struct task_struct *p, + struct rq *this_rq, int this_cpu); +#endif + +static int yield_to_task_fair(struct task_struct *p, int preempt) +{ + struct sched_entity *se = current-se; + struct sched_entity *pse = p-se; + struct sched_entity *curr = (task_rq(p)-curr)-se; + struct cfs_rq *cfs_rq = cfs_rq_of(se); + struct cfs_rq *p_cfs_rq = cfs_rq_of(pse); + int yield = this_rq() == task_rq(p); + int want_preempt = preempt; + +#ifdef CONFIG_FAIR_GROUP_SCHED + if (cfs_rq-tg != p_cfs_rq-tg) + return 0; + + /* Preemption only allowed within the same task group. */ + if (preempt cfs_rq-tg != cfs_rq_of(curr)-tg) + preempt = 0; +#endif + /* Preemption only allowed within the same thread group. */ + if (preempt !same_thread_group(current, task_of(p_cfs_rq-curr))) + preempt = 0; + +#ifdef CONFIG_SMP + /* +* If this yield is important enough to want to preempt instead +* of only dropping a -next hint, we're alone, and the target +* is not alone, pull the target to this cpu. +*/ + if (want_preempt !yield cfs_rq-nr_running == 1 + cpumask_test_cpu(smp_processor_id(), p-cpus_allowed)) { + pull_task(task_rq(p), p, this_rq(), smp_processor_id()); + p_cfs_rq = cfs_rq_of(pse); + yield = 1; + } +#endif + + if (yield) + clear_buddies(cfs_rq, se); + else if (preempt) + clear_buddies(p_cfs_rq, curr); + + /* Tell the scheduler that we'd really like pse to run next. */ +
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Mon, 2011-01-03 at 16:29 -0500, Rik van Riel wrote: From: Mike Galbraith efa...@gmx.de Add a yield_to function to the scheduler code, allowing us to give enough of our timeslice to another thread to allow it to run and release whatever resource we need it to release. We may want to use this to provide a sys_yield_to system call one day. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Mike Galbraith efa...@gmx.de --- Mike, want to change the above into a Signed-off-by: ? :) This code seems to work well. Done. -Mike -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
From: Mike Galbraith efa...@gmx.de Add a yield_to function to the scheduler code, allowing us to give enough of our timeslice to another thread to allow it to run and release whatever resource we need it to release. We may want to use this to provide a sys_yield_to system call one day. At least I want. Ruby has GIL(giant interpreter lock). And giant lock naturally enforce an app to implement cooperative multithreading model. Therefore it has similar problem with your one. Python solved this issue by slowing lock mechanism (two pthread-cond wakeup each GIL releasing), but I don't want it. Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can avoid almost Nehalem (and other P2P cache arch) lock unfairness problem. (probaby creating pthread_condattr_setautoyield_np or similar knob is good one) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html