Thanks Phil, that appears to solve the problem! I tested it both against head and orange branch and didn't see any of the infinite looping or client segfaults. I tested it without any of the other changes so it looks like that patch alone resolves the issue.
Michael On Fri, Jan 15, 2010 at 03:28:54PM -0500, Phil Carns wrote: > Hi Michael, > > I just tried your test case on a clean trunk build here and was able to > reproduce the pvfs2-client-core segfault 100% of the time on my box. > > The problem in a nutshell is that pvfs2-client-core was trying to cancel > a small-io operation using logic that is only appropriate for a normal > I/O operation, in turn causing some memory corruptions. > > Can you try out the fix and see if it solves the problem for you? The > patch is attached your you can pull it from cvs trunk. > > You might want to try that change by itself (without the op purged > change) first and go from there. Some of the other issues you ran into > may have been an after-effect from the cancel problem. > > -Phil > > > Michael Moore wrote: > > On Thu, Jan 07, 2010 at 06:48:19PM -0500, Michael Moore wrote: > >> On Thu, Jan 07, 2010 at 03:07:17PM -0600, Sam Lang wrote: > >>> On Jan 7, 2010, at 2:29 PM, Michael Moore wrote: > >>> > >>>> On Thu, Jan 07, 2010 at 12:30:00PM -0600, Sam Lang wrote: > >>>>> On Jan 7, 2010, at 11:32 AM, Michael Moore wrote: > >>>>> > >>>>>> In continuing to look at the 100% CPU usage (kernel loop) Randy had > >>>>>> written about previously I've narrowed the issue down a little. It > >>>>>> appears related to cancellation of operations when a write() call > >>>>>> is blocking and I/O has been retried. > >>>>>> > >>>>>> While on our cluster the retries were caused by congestion I am > >>>>>> re-creating the congestion by killing an I/O server. The test C > >>>>>> program > >>>>>> I'm using just loops around writes of 4k to a PVFS file. If, > >>>>>> while the program is executing, I kill a PVFS I/O server the write > >>>>>> hangs > >>>>>> (expectedly) . About 30% of the time when I try to kill the > >>>>>> process doing the writing it spikes to 100% CPU usage and is not > >>>>>> killable. Also, every time I try to kill the writing process > >>>>>> pvfs2-client-core segfaults with something similar to: > >>>>>> > >>>>>> [E 11:58:09.724121] PVFS2 client: signal 11, faulty address is 0x41ec, > >>>>>> from 0x8050b51 > >>>>>> [E 11:58:09.725403] [bt] pvfs2-client-core [0x8050b51] > >>>>>> [E 11:58:09.725427] [bt] pvfs2-client-core(main+0xe48) [0x8052498] > >>>>>> [E 11:58:09.725436] [bt] /lib/libc.so.6(__libc_start_main+0xdc) > >>>>>> [0x75ee9c] > >>>>>> [E 11:58:09.725444] [bt] pvfs2-client-core [0x804a381] > >>>>>> [E 11:58:09.740133] Child process with pid 2555 was killed by an > >>>>>> uncaught signal 6 > >>>>>> > >>>>>> In the cases when the CPU usage becomes 100% (and the process can't be > >>>>>> terminated) the for() loop in PINT_client_io_cancel strangely > >>>>>> segfaults > >>>>>> during exactly iteration 31. The value of sm_p->u.io.context_count is > >>>>>> in the hunderds so there are a signifigant number of jobs left to > >>>>>> cancel. > >>>>> Hi Michael, > >>>>> > >>>>> Are you guys using infiniband by chance? Do you have a stack trace > >>>>> with debugging symbols where the pvfs2-client-core segfault occurs? > >>>>> That might be useful for narrowing things down. > >>>>> > >>>> No infiniband at use, vanilla ethernet. > >>>> > >>>> Stack trace with symbols: > >>>> Program received signal SIGSEGV, Segmentation fault. > >>>> [Switching to Thread 0xb7fe06d0 (LWP 2069)] > >>>> PINT_client_io_cancel (id=5594) at > >>>> src/client/sysint/client-state-machine.c:524 > >>>> 524 if (cur_ctx->msg_send_in_progress) > >>>> (gdb) > >>>> (gdb) > >>>> (gdb) bt > >>>> #0 PINT_client_io_cancel (id=5594) > >>>> at src/client/sysint/client-state-machine.c:524 > >>>> #1 0x0804b926 in service_operation_cancellation (vfs_request=0xa073d98) > >>>> at src/apps/kernel/linux/pvfs2-client-core.c:407 > >>>> #2 0x0804f141 in handle_unexp_vfs_request (vfs_request=0xa073d98) > >>>> at src/apps/kernel/linux/pvfs2-client-core.c:2959 > >>>> #3 0x08050c7b in process_vfs_requests () > >>>> at src/apps/kernel/linux/pvfs2-client-core.c:3148 > >>>> #4 0x08052498 in main (argc=12, argv=0xbfe0d224) > >>>> at src/apps/kernel/linux/pvfs2-client-core.c:3561 > >>>> (gdb) > > > > In addition to the above segfault, I'm also seeing one for vfs_request > > in process_vfs_requests. The vfs_request segfault is the one depicted in > > the logs attached. The segfault is just at the first time the function > > accesses vfs_request: > > Program received signal SIGSEGV, Segmentation fault. > > 0x08050cd7 in process_vfs_requests () > > at src/apps/kernel/linux/pvfs2-client-core.c:3125 > > 3125 if (vfs_request->num_ops == 1 && > > > > Again, this is orange branch CVS line numbers. As a side note I get a > > configure error when trying to build the head version: > > "config.status: error: cannot find input file: > > src/io/bmi/bmi_zoid/module.mk.in" > > > > It's obvious that vfs_request is not a valid address but I > > haven't figured out where it's getting mucked with or thrown away. Note, > > these logs are after making the op_state_purged(op) change recommended > > below. That change seems to have prevented the kernel infinite loop > > issue as you indicated it would. > > > > I don't think I'm doing anything special from the application side but > > I'm also attaching the application I'm using to create this issue. > > > > Thanks, > > Michael > > > >>> What version of PVFS is this? I'm having trouble matching up the lines > >>> to anything interesting. > >>> -sam > >>> > >> Sorry, Orange-Branch out of CVS. > >> > >> Michael > >> > >>>> > >>>>>> The real issue is the 30% of the time when the process gets stuck in > >>>>>> the > >>>>>> kernel waiting for a downcall. With some additional debugging, the > >>>>>> process's write() call is clearly stuck in the while(1) loop in > >>>>>> wait_for_cancellation_downcall(). The function's assumption is that > >>>>>> either the request will timeout or it will be serviced after one > >>>>>> iteration of the loop. However, in this situation it neither occurs. > >>>>>> The > >>>>>> schedule_timeout() call immediately returns with a signal pending but > >>>>>> the op is never serviced so it spins indefinately. > >>>>> Those infinite looping conditionals have bugged me for a while now > >>>>> (there's one in wait_for_matching_downcall too). We should probably go > >>>>> through and systematically replace all of them. Lets try to fix your > >>>>> bug first though. > >>>>> > >>>>> I'm actually surprised that the process goes to 100% cpu in this case. > >>>>> You're right that the op is not getting serviced, so the first if > >>>>> conditional won't break out of the while loop. But the > >>>>> schedule_timeout should only return non-zero if the task gets woken up, > >>>>> and it only gets woken up in purge_waiting_ops() when the > >>>>> pvfs2-client-core segfaults. And that should only happen once. One > >>>>> thing you might try is to change the first if conditional from: > >>>>> > >>>>> if (op_state_serviced(op)) > >>>>> > >>>>> to: > >>>>> > >>>>> if (op_state_serviced(op) || op_state_purged(op)) > >>>>> > >>>>> That will allow purged ops to exit the while loop. Could you share > >>>>> your debugging output and modified code? > >>>> No special code other than some gossip_debugs. The > >>>> kernel debugging right before and including the cancel is: > >>>> > >>>> Jan 7 15:15:25 vm04 kernel: Alloced OP (d0c48020: 875 OP_FILE_IO) > >>>> Jan 7 15:15:25 vm04 kernel: GET op d0c48020 -> buffer_index 0 > >>>> Jan 7 15:15:25 vm04 kernel: pvfs2_file_write: copy_to_user 1 nr_segs 1, > >>>> offset: 3129344 total_size: 4096 > >>>> Jan 7 15:15:25 vm04 kernel: pvfs_bufmap_copy_iovec_from_user: index 0, > >>>> size 4096 > >>>> Jan 7 15:15:25 vm04 kernel: First character (integer value) in > >>>> pvfs_bufmap_copy_from_user: 63 > >>>> Jan 7 15:15:25 vm04 kernel: pvfs2: service_operation: pvfs2_file_write > >>>> d0c48020 > >>>> Jan 7 15:15:25 vm04 kernel: pvfs2: operation posted by process: > >>>> pvfs_rand_write, pid: 2076 > >>>> Jan 7 15:15:25 vm04 kernel: Releasing OP (d0c98060: 874) > >>>> Jan 7 15:15:25 vm04 kernel: client-core: reading op tag 875 OP_FILE_IO > >>>> Jan 7 15:15:55 vm04 kernel: *** operation interrupted by a signal (tag > >>>> 875, op d0c48020) > >>>> Jan 7 15:15:55 vm04 kernel: Interrupted: Removed op d0c48020 from > >>>> htable_ops_in_progress > >>>> Jan 7 15:15:55 vm04 kernel: pvfs2: service_operation pvfs2_file_write > >>>> returning: -4 for d0c48020. > >>>> Jan 7 15:15:55 vm04 kernel: pvfs2_cancel_op_in_progress called on tag > >>>> 875 > >>>> Jan 7 15:15:55 vm04 kernel: Alloced OP (da5000a0: 876 OP_CANCEL) > >>>> Jan 7 15:15:55 vm04 kernel: Attempting PVFS2 operation cancellation of > >>>> tag 875 > >>>> Jan 7 15:15:55 vm04 kernel: pvfs2: service_operation: pvfs2_cancel > >>>> da5000a0 > >>>> Jan 7 15:15:55 vm04 kernel: pvfs2: operation posted by process: > >>>> pvfs_rand_write, pid: 2076 > >>>> Jan 7 15:15:55 vm04 kernel: *** operation didn't time out: da5000a0, > >>>> jiffies_left: 20000 > >>>> Jan 7 15:15:57 vm04 last message repeated 3 times > >>>> Jan 7 15:15:58 vm04 kernel: *** operation didn't time out: da5000a0, > >>>> jiffies_left: 19999 > >>>> Jan 7 15:15:58 vm04 kernel: *** operation didn't time out: da5000a0, > >>>> jiffies_left: 20000 > >>>> Jan 7 15:15:58 vm04 last message repeated 59 times > >>>> ............ > >>>> > >>>> > >>>> The pvfs2-client debug log (from an earlier failure) proves that my > >>>> statement earlier about segfaulting in the 31st iteration false. This > >>>> one goes into the 70's. > >>>> > >>>> [D 14:35:53.231333] cancel_op_in_progress called > >>>> [D 14:35:53.231342] cancelling I/O req 0x88c7f98 from tag 1128 > >>>> [D 14:35:53.231357] PINT_client_io_cancel id 7350 > >>>> [D 14:35:53.231368] job iteration [0] > >>>> [D 14:35:53.231377] [0] Posting cancellation of type: BMI Send (Request) > >>>> [D 14:35:53.231386] [0] Posting cancellation of type: BMI Recv > >>>> (Response) > >>>> [D 14:35:53.231396] [0] Posting cancellation of type: FLOW > >>>> [D 14:35:53.231431] [0] Posting cancellation of type: BMI Recv (Write > >>>> Ack) > >>>> [D 14:35:53.231444] job iteration [1] > >>>> [D 14:35:53.231452] [1] Posting cancellation of type: BMI Send (Request) > >>>> [D 14:35:53.231461] [1] Posting cancellation of type: BMI Recv > >>>> (Response) > >>>> [D 14:35:53.231469] [1] Posting cancellation of type: BMI Recv (Write > >>>> Ack) > >>>> [D 14:35:53.231502] job iteration [2] > >>>> [D 14:35:53.231510] [2] Posting cancellation of type: BMI Send (Request) > >>>> [D 14:35:53.231519] [2] Posting cancellation of type: BMI Recv > >>>> (Response) > >>>> [D 14:35:53.231528] [2] Posting cancellation of type: FLOW > >>>> [D 14:35:53.231536] [2] Posting cancellation of type: BMI Recv (Write > >>>> Ack) > >>>> [D 14:35:53.231545] job iteration [3] > >>>> [D 14:35:53.231553] [3] Posting cancellation of type: BMI Send (Request) > >>>> [D 14:35:53.231562] [3] Posting cancellation of type: BMI Recv > >>>> (Response) > >>>> [D 14:35:53.231571] [3] Posting cancellation of type: FLOW > >>>> [D 14:35:53.231579] [3] Posting cancellation of type: BMI Recv (Write > >>>> Ack) > >>>> [D 14:35:53.231588] job iteration [4] > >>>> [D 14:35:53.231596] [4] Posting cancellation of type: BMI Send (Request) > >>>> [D 14:35:53.231605] [4] Posting cancellation of type: BMI Recv > >>>> (Response) > >>>> [D 14:35:53.231614] [4] Posting cancellation of type: FLOW > >>>> [D 14:35:53.231622] [4] Posting cancellation of type: BMI Recv (Write > >>>> Ack) > >>>> [D 14:35:53.231628] job iteration [5] > >>>> [D 14:35:53.231666] [5] Posting cancellation of type: BMI Send (Request) > >>>> [D 14:35:53.231675] [5] Posting cancellation of type: BMI Recv > >>>> (Response) > >>>> [D 14:35:53.231684] [5] Posting cancellation of type: FLOW > >>>> [D 14:35:53.231692] [5] Posting cancellation of type: BMI Recv (Write > >>>> Ack) > >>>> [D 14:35:53.231701] job iteration [6] > >>>> [D 14:35:53.231709] [6] Posting cancellation of type: BMI Send (Request) > >>>> [D 14:35:53.231718] [6] Posting cancellation of type: BMI Recv > >>>> (Response) > >>>> [D 14:35:53.231727] [6] Posting cancellation of type: FLOW > >>>> [D 14:35:53.231735] [6] Posting cancellation of type: BMI Recv (Write > >>>> Ack) > >>>> ............ > >>>> <snip> > >>>> ............ > >>>> [D 14:35:53.235128] job iteration [69] > >>>> [D 14:35:53.235136] [69] Posting cancellation of type: FLOW > >>>> [D 14:35:53.235145] [69] Posting cancellation of type: BMI Recv (Write > >>>> Ack) > >>>> [D 14:35:53.235153] job iteration [70] > >>>> [D 14:35:53.235161] job iteration [71] > >>>> [D 14:35:53.235169] job iteration [72] > >>>> [D 14:35:53.235177] job iteration [73] > >>>> [E 14:35:53.235212] PVFS2 client: signal 11, faulty address is 0xb8c0c8, > >>>> from 0xb26900 > >>>> [E 14:35:53.235810] [bt] > >>>> /usr/local/pvfs/lib/libpvfs2.so(PINT_client_io_cancel+0 > >>>> xe0) [0xb26900] > >>>> [E 14:35:53.235828] [bt] pvfs2-client-core [0x804b926] > >>>> [E 14:35:53.235836] [bt] pvfs2-client-core [0x804f141] > >>>> [E 14:35:53.235844] [bt] pvfs2-client-core [0x8050c7b] > >>>> [E 14:35:53.235852] [bt] pvfs2-client-core(main+0xe48) [0x8052498] > >>>> > >>>> > >>>> I'll give the pending_ops change a try and post the results. > >>>> > >>>> Michael > >>>> > >>>>> Thanks, > >>>>> > >>>>> -sam > >>>>> > >>>>>> Has anyone else seen the issue with client-core segfaulting on every > >>>>>> cancel op? Should the kernel wait_for_cancellation_downcall() be > >>>>>> changed > >>>>>> to not allow indefinite looping? > >>>>>> > >>>>>> Thanks, > >>>>>> Michael > >>>>>> _______________________________________________ > >>>>>> Pvfs2-developers mailing list > >>>>>> [email protected] > >>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > >> _______________________________________________ > >> Pvfs2-developers mailing list > >> [email protected] > >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > >> > >> ------------------------------------------------------------------------ > >> > >> _______________________________________________ > >> Pvfs2-developers mailing list > >> [email protected] > >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > > Index: src/client/sysint/client-state-machine.c > =================================================================== > RCS file: /projects/cvsroot/pvfs2-1/src/client/sysint/client-state-machine.c,v > retrieving revision 1.103 > diff -a -u -p -r1.103 client-state-machine.c > --- src/client/sysint/client-state-machine.c 7 Oct 2009 13:48:49 -0000 > 1.103 > +++ src/client/sysint/client-state-machine.c 15 Jan 2010 20:15:16 -0000 > @@ -483,6 +483,7 @@ PVFS_error PINT_client_io_cancel(PVFS_sy > PVFS_error ret = -PVFS_EINVAL; > PINT_smcb *smcb = NULL; > PINT_client_sm *sm_p = NULL; > + PINT_client_sm *sm_base_p = NULL; > > gossip_debug(GOSSIP_CLIENT_DEBUG, > "PINT_client_io_cancel id %lld\n",lld(id)); > @@ -507,6 +508,18 @@ PVFS_error PINT_client_io_cancel(PVFS_sy > { > /* op already completed; nothing to cancel. */ > return 0; > + } > + > + /* We also don't cancel small I/O operations as posted by > + * sys-small-io.sm. Check the corresponding flag. We have > + * to jump to the base frame rather than the current frame for this > + * information because small-io may have pushed a msgpairarray. > + */ > + sm_base_p = PINT_sm_frame(smcb, (-(smcb->frame_count -1))); > + if(sm_base_p->u.io.small_io) > + { > + gossip_debug(GOSSIP_CANCEL_DEBUG, "skipping cancellation of small > I/O operation.\n"); > + return(0); > } > > /* if we fall to here, the I/O operation is still in flight */ _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
