On 2011-08-08, at 10:03 AM, chas williams - CONTRACTOR wrote: > we have seen a few crashes that look like: > > [250696.381575] RIP: 0010:[<ffffffffa0a1f9e4>] [<ffffffffa0a1f9e4>] > mdc_exit_request+0x74/0xb0 [mdc] > ... > [250696.381575] Call Trace: > [250696.381575] [<ffffffffa0a25042>] > mdc_intent_getattr_async_interpret+0x82/0x500 [mdc] > [250696.381575] [<ffffffffa089efd0>] ptlrpc_check_set+0x200/0x1690 [ptlrpc] > [250696.381575] [<ffffffffa08d3140>] ptlrpcd_check+0x110/0x250 [ptlrpc] > > and i sort of gather the problem arises from mdc_enter_request(). > it allocates an mdc_cache_waiter on the stack and inserts it into the > wait list and then returns. > > int mdc_enter_request(struct client_obd *cli) > ... > struct mdc_cache_waiter mcw; > ... > list_add_tail(&mcw.mcw_entry, &cli->cl_cache_waiters); > init_waitqueue_head(&mcw.mcw_waitq); > > later mdc_exit_request() finds this mcw by iterating the list. > seeing as mcw was allocated on the stack, i dont think you can do this. > mcw might have been reused by the time mdc_exit_request() gets around > to removing it.
What version of Lustre is this? Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss