On Thu, Jun 11, 2015 at 3:02 AM, Neel Natu <neeln...@gmail.com> wrote:
> Hi Stefan,
>
> On Wed, Jun 10, 2015 at 1:14 PM, Stefan Andritoiu
> <stefan.andrit...@gmail.com> wrote:
>> Hello,
>>
>> I am currently working on a gang scheduling implementation for the
>> bhyve VCPU-threads on FreeBSD 10.1.
>> I have added a new field "int gang" to the thread structure to specify
>> the gang it is part of (0 for no gang), and have modified the bhyve
>> code to initialize this field when a VCPU is created. I will post
>> these modifications in another message.
>>
>> When I start a Virtual Machine, during the guest's boot, IPIs are sent
>> and received correctly between CPUs, but after a few seconds I get:
>>     spin lock 0xffffffff8164c290 (smp rendezvous) held by
>> 0xfffff8000296c000 (tid 100009) too long
>>     panic: spin lock held too long
>>
>> If I limit the number of IPIs that are sent, I do not have this
>> problem. Which leads me to believe that (because of the constant
>> context-switch when the guest boots), the high number of IPIs sent
>> starve the system.
>>
>> Does anyone know what is happening? And maybe know of a possible solution?
>>
>
> In your patch 'smp_rendezvous()' is being called with the TDQ locked.
>
> There are a few code paths in ULE where it will want to lock two TDQs
> at the same time (see tdq_lock_pair()). This has the potential to
> cause a deadlock if the 2nd TDQ in tdq_lock_pair() is the one that was
> locked before calling 'smp_rendezvous()'.
>
> To verify this theory can you set the following sysctls and repeat the test?
> $ sysctl kern.sched.steal_idle=0
> $ sysctl kern.sched.rebalance=0
>
> best
> Neel
>

Hi Neel

I do not seem have a kern.sched.rebalance variable. I do have
kern.sched.balance>
I have tested with both
  sysctl kern.sched.steal_idle=0
  sysctl kern.sched.balance=0
and
  sysctl kern.sched.steal_idle=0
  sysctl kern.sched.balance=1

Unfortunately, in both cases the result is the same as before:
  panic: spin lock held too long

best
Stefan

>> Thank you,
>> Stefan
>>
>>
>> ======================================================================================
>> I have added here the modifications to the sched_ule.c file and a
>> brief explanation of it:
>>
>> In struct tdq, I have added two new field:
>>   - int scheduled_gang;
>>     /* Set to a non-zero value if the respective CPU is required to
>> schedule a thread belonging to a gang. The value of scheduled_gang
>> also being the ID of the gang that we want scheduled. For now I have
>> considered only one running guest, so the value is 0 or 1 */
>>   - int gang_leader;
>>     /* Set if the respective CPU is the one who has initialized gang
>> scheduling. Zero otherwise. Not relevant to the final code and will be
>> removed. Just for debugging purposes. */
>>
>> Created a new function "static void schedule_gang(void * arg)" that
>> will be called by each processor when it receives an IPI from the gang
>> leader:
>>   - sets scheduled_gang = 1
>>   - informs the system that it needs to reschedule. Not yet implemented
>>
>> In function "struct thread* tdq_choose (struct tdq * tdq)":
>>     if (tdq->scheduled_gang) - checks to see if a thread belonging to
>> a gang must be scheduled. If so, calls functions that check the runqs
>> and return a gang thread. I have yet to implement these functions.
>>
>> In function "sched_choose()":
>>    if (td->gang) - checks if the chosen thread is part of a gang. If
>> so it signals all other CPUs to run function "schedule_gang(void *
>> gang)".
>>    if (tdq->scheduled_gang) - if scheduled_gang is set it means that
>> the scheduler is called after the the code in schedule_gang() has ran,
>> and bypasses sending IPIs to the other CPUs. If not for this checkup,
>> a CPU would receive a IPI; set scheduled_gang=1; the scheduler would
>> be called and would choose a thread to run; that thread would be part
>> of a gang; an IPI would be sent to all other CPUs. A constant
>> back-and-forth of IPIs between the CPUs would be created.
>>
>> The CPU that initializes gang scheduling, does not receive an IPI, and
>> does not even call the "schedule_gang(void * gang)" function. It
>> continues in scheduling the gang-thread it selected, the one that
>> started the gang scheduling process.
>>
>>
>> ===================================================================
>> --- sched_ule.c (revision 24)
>> +++ sched_ule.c (revision 26)
>> @@ -247,6 +247,9 @@
>>   struct runq tdq_timeshare; /* timeshare run queue. */
>>   struct runq tdq_idle; /* Queue of IDLE threads. */
>>   char tdq_name[TDQ_NAME_LEN];
>> +
>> + int gang_leader;
>> + int scheduled_gang;
>>  #ifdef KTR
>>   char tdq_loadname[TDQ_LOADNAME_LEN];
>>  #endif
>> @@ -1308,6 +1311,20 @@
>>   struct thread *td;
>>
>>   TDQ_LOCK_ASSERT(tdq, MA_OWNED);
>> +
>> + /* Pick gang thread to run */
>> + if (tdq->scheduled_gang){
>> + /* basically the normal choosing of threads but with regards to 
>> scheduled_gang
>> + tdq = runq_choose_gang(&tdq->realtime);
>> + if (td != NULL)
>> + return (td);
>> +
>> + td = runq_choose_from_gang(&tdq->tdq_timeshare, tdq->tdq_ridx);
>> + if (td != NULL)
>> + return (td);
>> + */
>> + }
>> +
>>   td = runq_choose(&tdq->tdq_realtime);
>>   if (td != NULL)
>>   return (td);
>> @@ -2295,6 +2312,22 @@
>>   return (load);
>>  }
>>
>> +static void
>> +schedule_gang(void * arg){
>> + struct tdq *tdq;
>> + struct tdq *from_tdq = arg;
>> + tdq = TDQ_SELF();
>> +
>> + if(tdq == from_tdq){
>> + /* Just for testing IPI. Code is never reached, and should never be*/
>> + tdq->scheduled_gang = 1;
>> +// printf("[schedule_gang] received IPI from himself\n");
>> + }
>> + else{
>> + tdq->scheduled_gang = 1;
>> +// printf("[schedule_gang] received on cpu: %s \n", tdq->tdq_name);
>> + }
>> +}
>>  /*
>>   * Choose the highest priority thread to run.  The thread is removed from
>>   * the run-queue while running however the load remains.  For SMP we set
>> @@ -2305,11 +2338,26 @@
>>  {
>>   struct thread *td;
>>   struct tdq *tdq;
>> + cpuset_t map;
>>
>>   tdq = TDQ_SELF();
>>   TDQ_LOCK_ASSERT(tdq, MA_OWNED);
>>   td = tdq_choose(tdq);
>>   if (td) {
>> + if(tdq->scheduled_gang){
>> + /* Scheduler called after IPI
>> + jump over rendezvous*/
>> + tdq->scheduled_gang = 0;
>> + }
>> + else{
>> + if(td->gang){
>> + map = all_cpus;
>> + CPU_CLR(curcpu, &map);
>> +
>> + smp_rendezvous_cpus(map, NULL, schedule_gang, NULL, tdq);
>> + }
>> + }
>> +
>>   tdq_runq_rem(tdq, td);
>>   tdq->tdq_lowpri = td->td_priority;
>>   return (td);
>> _______________________________________________
>> freebsd-virtualization@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
>> To unsubscribe, send any mail to 
>> "freebsd-virtualization-unsubscr...@freebsd.org"
_______________________________________________
freebsd-virtualization@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"

Reply via email to