On Wed, Apr 16, 2008 at 12:15:08PM -0700, Christoph Lameter wrote: > On Wed, 16 Apr 2008, Robin Holt wrote: > > > On Wed, Apr 16, 2008 at 11:35:38AM -0700, Christoph Lameter wrote: > > > On Wed, 16 Apr 2008, Robin Holt wrote: > > > > > > > I don't think this lock mechanism is completely working. I have > > > > gotten a few failures trying to dereference 0x100100 which appears to > > > > be LIST_POISON1. > > > > > > How does xpmem unregistering of notifiers work? > > > > For the tests I have been running, we are waiting for the release > > callout as part of exit. > > Some more details on the failure may be useful. AFAICT list_del[_rcu] is > the culprit here and that is only used on release or unregister.
I think I have this understood now. It happens quite quickly (within 10 minutes) on a 128 rank job of small data set in a loop. In these failing jobs, all the ranks are nearly symmetric. There is a certain part of each ranks address space that has access granted. All the ranks have included all the other ranks including themselves in exactly the same layout at exactly the same virtual address. Rank 3 has hit _release and is beginning to clean up, but has not deleted the notifier from its list. Rank 9 calls the xpmem_invalidate_page() callout. That page was attached by rank 3 so we call zap_page_range on rank 3 which then calls back into xpmem's invalidate_range_start callout. The rank 3 _release callout begins and deletes its notifier from the list. Rank 9's call to rank 3's zap_page_range notifier returns and dereferences LIST_POISON1. I often confuse myself while trying to explain these so please kick me where the holes in the flow appear. The console output from the simple debugging stuff I put in is a bit overwhelming. I am trying to figure out now which locks we hold as part of the zap callout that should have prevented the _release callout. Thanks, Robin ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel