Hey guys,                                                                       
                                                                                
                                                                                
                                                                                
                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                          
While working to get our newly upgraded 6k core machine online, we've           
                                                                                
                                                                                
                                                                                
                                                          
discovered a few possible locking issues in the stop_machine code that          
                                                                                
                                                                                
                                                                                
                                                          
we're trying to get sorted out.  (We think) the problems we're seeing           
                                                                                
                                                                                
                                                                                
                                                          
stem from possible interaction between stop_cpus and stop_one_cpu.  The         
                                                                                
                                                                                
                                                                                
                                                          
issue presents as a deadlock, and seems to only show itself                     
                                                                                
                                                                                
                                                                                
                                                          
intermittently.                                                                 
                                                                                
                                                                                
                                                                                
                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                          
After quite a bit of debugging we think we've narrowed the issue down to        
                                                                                
                                                                                
                                                                                
                                                          
the fact that stop_one_cpu does not respect many of the locks that are          
                                                                                
                                                                                
                                                                                
                                                          
taken in the stop_cpus code path.  For reference the stop_cpus code path        
                                                                                
                                                                                
                                                                                
                                                          
takes the stop_cpus_mutex, then stop_cpus_lock, and then takes each             
                                                                                
                                                                                
                                                                                
                                                          
cpu's stopper->lock.  stop_one_cpu seems to rely solely on the                  
                                                                                
                                                                                
                                                                                
                                                          
stopper->lock.                                                                  
                                                                                
                                                                                
                                                                                
                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                          
What appears to be happening to cause our deadlock is, stop_cpus works          
                                                                                
                                                                                
                                                                                
                                                          
its way down to queue_stop_cpus_work, which tells each cpu's stopper            
                                                                                
                                                                                
                                                                                
                                                          
task to wake up, take its lock, and do its work.  As the loop that does         
                                                                                
                                                                                
                                                                                
                                                          
this progresses, the lowest numbered cpus complete their work, and are          
                                                                                
                                                                                
                                                                                
                                                          
allowed to go on about their business.  The problem occurs when one of          
                                                                                
                                                                                
                                                                                
                                                          
these lower numbered cpus calls stop_one_cpu, targeting one of the              
                                                                                
                                                                                
                                                                                
                                                          
higher numbered cpus, which the stop_cpus loop has not yet reached.  If         
                                                                                
                                                                                
                                                                                
                                                          
this happens, that higher numbered cpu's completion variable will get           
                                                                                
                                                                                
                                                                                
                                                          
stomped on, and the wait_for_completion in the stop_cpus code path will         
                                                                                
                                                                                
                                                                                
                                                          
never return.                                                                   
                                                                                
                                                                                
                                                                                
                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                          
A quick example: CPU 0 calls stop_cpus, which will hit all 6,000 cores.         
                                                                                
                                                                                
                                                                                
                                                          
CPU 50 completes its stopper work, and at some point in the near future         
                                                                                
                                                                                
                                                                                
                                                          
calls stop_one_cpu on CPU 5000.  This clobbers CPU 5000's pointer to the        
                                                                                
                                                                                
                                                                                
                                                          
cpu_stop_done struct set up in queue_stop_cpus_work, meaning that, once         
                                                                                
                                                                                
                                                                                
                                                          
CPU 5000 completes its work, it won't be able to decrement the nr_todo          
                                                                                
                                                                                
                                                                                
                                                          
for the correct cpu_stop_done struct, and CPU 0's wait_for_completion           
                                                                                
                                                                                
                                                                                
                                                          
will never return.                                                              
                                                                                
                                                                                
                                                                                
                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                          
Again, much of this is semi-educated guesswork, put together based on           
                                                                                
                                                                                
                                                                                
                                                          
information gathered from examining lots of debug output, in an attempt         
                                                                                
                                                                                
                                                                                
                                                          
to spot the problem.  We're fairly certain that we've pinned down our           
                                                                                
                                                                                
                                                                                
                                                          
issue, but we'd like to ask those who are more knowledgeable of these           
                                                                                
                                                                                
                                                                                
                                                          
code paths to weigh in their opinions here.                                     
                                                                                
                                                                                
                                                                                
                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                          
We'd really appreciate any help that anyone can offer.  Thanks!                 
                                                                                
                                                                                
                                                                                
                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                          
- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to