Hi Stefan,

On Wed, Jun 10, 2015 at 1:14 PM, Stefan Andritoiu
<stefan.andrit...@gmail.com> wrote:
> Hello,
>
> I am currently working on a gang scheduling implementation for the
> bhyve VCPU-threads on FreeBSD 10.1.
> I have added a new field "int gang" to the thread structure to specify
> the gang it is part of (0 for no gang), and have modified the bhyve
> code to initialize this field when a VCPU is created. I will post
> these modifications in another message.
>
> When I start a Virtual Machine, during the guest's boot, IPIs are sent
> and received correctly between CPUs, but after a few seconds I get:
>     spin lock 0xffffffff8164c290 (smp rendezvous) held by
> 0xfffff8000296c000 (tid 100009) too long
>     panic: spin lock held too long
>
> If I limit the number of IPIs that are sent, I do not have this
> problem. Which leads me to believe that (because of the constant
> context-switch when the guest boots), the high number of IPIs sent
> starve the system.
>
> Does anyone know what is happening? And maybe know of a possible solution?
>

In your patch 'smp_rendezvous()' is being called with the TDQ locked.

There are a few code paths in ULE where it will want to lock two TDQs
at the same time (see tdq_lock_pair()). This has the potential to
cause a deadlock if the 2nd TDQ in tdq_lock_pair() is the one that was
locked before calling 'smp_rendezvous()'.

To verify this theory can you set the following sysctls and repeat the test?
$ sysctl kern.sched.steal_idle=0
$ sysctl kern.sched.rebalance=0

best
Neel

> Thank you,
> Stefan
>
>
> ======================================================================================
> I have added here the modifications to the sched_ule.c file and a
> brief explanation of it:
>
> In struct tdq, I have added two new field:
>   - int scheduled_gang;
>     /* Set to a non-zero value if the respective CPU is required to
> schedule a thread belonging to a gang. The value of scheduled_gang
> also being the ID of the gang that we want scheduled. For now I have
> considered only one running guest, so the value is 0 or 1 */
>   - int gang_leader;
>     /* Set if the respective CPU is the one who has initialized gang
> scheduling. Zero otherwise. Not relevant to the final code and will be
> removed. Just for debugging purposes. */
>
> Created a new function "static void schedule_gang(void * arg)" that
> will be called by each processor when it receives an IPI from the gang
> leader:
>   - sets scheduled_gang = 1
>   - informs the system that it needs to reschedule. Not yet implemented
>
> In function "struct thread* tdq_choose (struct tdq * tdq)":
>     if (tdq->scheduled_gang) - checks to see if a thread belonging to
> a gang must be scheduled. If so, calls functions that check the runqs
> and return a gang thread. I have yet to implement these functions.
>
> In function "sched_choose()":
>    if (td->gang) - checks if the chosen thread is part of a gang. If
> so it signals all other CPUs to run function "schedule_gang(void *
> gang)".
>    if (tdq->scheduled_gang) - if scheduled_gang is set it means that
> the scheduler is called after the the code in schedule_gang() has ran,
> and bypasses sending IPIs to the other CPUs. If not for this checkup,
> a CPU would receive a IPI; set scheduled_gang=1; the scheduler would
> be called and would choose a thread to run; that thread would be part
> of a gang; an IPI would be sent to all other CPUs. A constant
> back-and-forth of IPIs between the CPUs would be created.
>
> The CPU that initializes gang scheduling, does not receive an IPI, and
> does not even call the "schedule_gang(void * gang)" function. It
> continues in scheduling the gang-thread it selected, the one that
> started the gang scheduling process.
>
>
> ===================================================================
> --- sched_ule.c (revision 24)
> +++ sched_ule.c (revision 26)
> @@ -247,6 +247,9 @@
>   struct runq tdq_timeshare; /* timeshare run queue. */
>   struct runq tdq_idle; /* Queue of IDLE threads. */
>   char tdq_name[TDQ_NAME_LEN];
> +
> + int gang_leader;
> + int scheduled_gang;
>  #ifdef KTR
>   char tdq_loadname[TDQ_LOADNAME_LEN];
>  #endif
> @@ -1308,6 +1311,20 @@
>   struct thread *td;
>
>   TDQ_LOCK_ASSERT(tdq, MA_OWNED);
> +
> + /* Pick gang thread to run */
> + if (tdq->scheduled_gang){
> + /* basically the normal choosing of threads but with regards to 
> scheduled_gang
> + tdq = runq_choose_gang(&tdq->realtime);
> + if (td != NULL)
> + return (td);
> +
> + td = runq_choose_from_gang(&tdq->tdq_timeshare, tdq->tdq_ridx);
> + if (td != NULL)
> + return (td);
> + */
> + }
> +
>   td = runq_choose(&tdq->tdq_realtime);
>   if (td != NULL)
>   return (td);
> @@ -2295,6 +2312,22 @@
>   return (load);
>  }
>
> +static void
> +schedule_gang(void * arg){
> + struct tdq *tdq;
> + struct tdq *from_tdq = arg;
> + tdq = TDQ_SELF();
> +
> + if(tdq == from_tdq){
> + /* Just for testing IPI. Code is never reached, and should never be*/
> + tdq->scheduled_gang = 1;
> +// printf("[schedule_gang] received IPI from himself\n");
> + }
> + else{
> + tdq->scheduled_gang = 1;
> +// printf("[schedule_gang] received on cpu: %s \n", tdq->tdq_name);
> + }
> +}
>  /*
>   * Choose the highest priority thread to run.  The thread is removed from
>   * the run-queue while running however the load remains.  For SMP we set
> @@ -2305,11 +2338,26 @@
>  {
>   struct thread *td;
>   struct tdq *tdq;
> + cpuset_t map;
>
>   tdq = TDQ_SELF();
>   TDQ_LOCK_ASSERT(tdq, MA_OWNED);
>   td = tdq_choose(tdq);
>   if (td) {
> + if(tdq->scheduled_gang){
> + /* Scheduler called after IPI
> + jump over rendezvous*/
> + tdq->scheduled_gang = 0;
> + }
> + else{
> + if(td->gang){
> + map = all_cpus;
> + CPU_CLR(curcpu, &map);
> +
> + smp_rendezvous_cpus(map, NULL, schedule_gang, NULL, tdq);
> + }
> + }
> +
>   tdq_runq_rem(tdq, td);
>   tdq->tdq_lowpri = td->td_priority;
>   return (td);
> _______________________________________________
> freebsd-virtualization@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
> To unsubscribe, send any mail to 
> "freebsd-virtualization-unsubscr...@freebsd.org"
_______________________________________________
freebsd-virtualization@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"

Reply via email to