On Jan 7, 2010, at 2:29 PM, Michael Moore wrote:

> On Thu, Jan 07, 2010 at 12:30:00PM -0600, Sam Lang wrote:
>> 
>> On Jan 7, 2010, at 11:32 AM, Michael Moore wrote:
>> 
>>> In continuing to look at the 100% CPU usage (kernel loop) Randy had 
>>> written about previously I've narrowed the issue down a little. It 
>>> appears related to cancellation of operations when a write() call 
>>> is blocking and I/O has been retried. 
>>> 
>>> While on our cluster the retries were caused by congestion I am 
>>> re-creating the congestion by killing an I/O server. The test C program 
>>> I'm using just loops around writes of 4k to a PVFS file. If, 
>>> while the program is executing, I kill a PVFS I/O server the write hangs 
>>> (expectedly) . About 30% of the time when I try to kill the 
>>> process doing the writing it spikes to 100% CPU usage and is not 
>>> killable. Also, every time I try to kill the writing process 
>>> pvfs2-client-core segfaults with something similar to:
>>> 
>>> [E 11:58:09.724121] PVFS2 client: signal 11, faulty address is 0x41ec, 
>>> from 0x8050b51
>>> [E 11:58:09.725403] [bt] pvfs2-client-core [0x8050b51]
>>> [E 11:58:09.725427] [bt] pvfs2-client-core(main+0xe48) [0x8052498]
>>> [E 11:58:09.725436] [bt] /lib/libc.so.6(__libc_start_main+0xdc) 
>>> [0x75ee9c]
>>> [E 11:58:09.725444] [bt] pvfs2-client-core [0x804a381]
>>> [E 11:58:09.740133] Child process with pid 2555 was killed by an 
>>> uncaught signal 6
>>> 
>>> In the cases when the CPU usage becomes 100% (and the process can't be 
>>> terminated) the for() loop in PINT_client_io_cancel strangely segfaults 
>>> during exactly iteration 31. The value of sm_p->u.io.context_count is 
>>> in the hunderds so there are a signifigant number of jobs left to cancel.
>> 
>> Hi Michael,
>> 
>> Are you guys using infiniband by chance?  Do you have a stack trace with 
>> debugging symbols where the pvfs2-client-core segfault occurs?  That might 
>> be useful for narrowing things down.
>> 
> No infiniband at use, vanilla ethernet.
> 
> Stack trace with symbols:
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0xb7fe06d0 (LWP 2069)]
> PINT_client_io_cancel (id=5594) at 
> src/client/sysint/client-state-machine.c:524
> 524           if (cur_ctx->msg_send_in_progress)
> (gdb) 
> (gdb) 
> (gdb) bt
> #0  PINT_client_io_cancel (id=5594)
>    at src/client/sysint/client-state-machine.c:524
> #1  0x0804b926 in service_operation_cancellation (vfs_request=0xa073d98)
>    at src/apps/kernel/linux/pvfs2-client-core.c:407
> #2  0x0804f141 in handle_unexp_vfs_request (vfs_request=0xa073d98)
>    at src/apps/kernel/linux/pvfs2-client-core.c:2959
> #3  0x08050c7b in process_vfs_requests ()
>    at src/apps/kernel/linux/pvfs2-client-core.c:3148
> #4  0x08052498 in main (argc=12, argv=0xbfe0d224)
>    at src/apps/kernel/linux/pvfs2-client-core.c:3561
> (gdb) 

What version of PVFS is this?  I'm having trouble matching up the lines to 
anything interesting.
-sam

> 
> 
>>> The real issue is the 30% of the time when the process gets stuck in the 
>>> kernel waiting for a downcall. With some additional debugging, the 
>>> process's write() call is clearly stuck in the while(1) loop in 
>>> wait_for_cancellation_downcall(). The function's assumption is that 
>>> either the request will timeout or it will be serviced after one 
>>> iteration of the loop. However, in this situation it neither occurs. The 
>>> schedule_timeout() call immediately returns with a signal pending but 
>>> the op is never serviced so it spins indefinately.
>> 
>> Those infinite looping conditionals have bugged me for a while now (there's 
>> one in wait_for_matching_downcall too).  We should probably go through and 
>> systematically replace all of them.  Lets try to fix your bug first though.
>> 
>> I'm actually surprised that the process goes to 100% cpu in this case.  
>> You're right that the op is not getting serviced, so the first if 
>> conditional won't break out of the while loop.  But the schedule_timeout 
>> should only return non-zero if the task gets woken up, and it only gets 
>> woken up in purge_waiting_ops() when the pvfs2-client-core segfaults.  And 
>> that should only happen once.  One thing you might try is to change the 
>> first if conditional from:
>> 
>>        if (op_state_serviced(op))
>> 
>> to:
>> 
>>        if (op_state_serviced(op) || op_state_purged(op))
>> 
>> That will allow purged ops to exit the while loop.  Could you share your 
>> debugging output and modified code?
> 
> No special code other than some gossip_debugs. The
> kernel debugging right before and including the cancel is:
> 
> Jan  7 15:15:25 vm04 kernel: Alloced OP (d0c48020: 875 OP_FILE_IO)
> Jan  7 15:15:25 vm04 kernel: GET op d0c48020 -> buffer_index 0
> Jan  7 15:15:25 vm04 kernel: pvfs2_file_write: copy_to_user 1 nr_segs 1, 
> offset: 3129344 total_size: 4096
> Jan  7 15:15:25 vm04 kernel: pvfs_bufmap_copy_iovec_from_user: index 0, 
> size 4096
> Jan  7 15:15:25 vm04 kernel: First character (integer value) in 
> pvfs_bufmap_copy_from_user: 63
> Jan  7 15:15:25 vm04 kernel: pvfs2: service_operation: pvfs2_file_write 
> d0c48020
> Jan  7 15:15:25 vm04 kernel: pvfs2: operation posted by process: 
> pvfs_rand_write, pid: 2076
> Jan  7 15:15:25 vm04 kernel: Releasing OP (d0c98060: 874)
> Jan  7 15:15:25 vm04 kernel: client-core: reading op tag 875 OP_FILE_IO
> Jan  7 15:15:55 vm04 kernel: *** operation interrupted by a signal (tag 
> 875, op d0c48020)
> Jan  7 15:15:55 vm04 kernel: Interrupted: Removed op d0c48020 from 
> htable_ops_in_progress
> Jan  7 15:15:55 vm04 kernel: pvfs2: service_operation pvfs2_file_write 
> returning: -4 for d0c48020.
> Jan  7 15:15:55 vm04 kernel: pvfs2_cancel_op_in_progress called on tag 
> 875
> Jan  7 15:15:55 vm04 kernel: Alloced OP (da5000a0: 876 OP_CANCEL)
> Jan  7 15:15:55 vm04 kernel: Attempting PVFS2 operation cancellation of 
> tag 875
> Jan  7 15:15:55 vm04 kernel: pvfs2: service_operation: pvfs2_cancel 
> da5000a0
> Jan  7 15:15:55 vm04 kernel: pvfs2: operation posted by process: 
> pvfs_rand_write, pid: 2076
> Jan  7 15:15:55 vm04 kernel: *** operation didn't time out: da5000a0, 
> jiffies_left: 20000
> Jan  7 15:15:57 vm04 last message repeated 3 times
> Jan  7 15:15:58 vm04 kernel: *** operation didn't time out: da5000a0, 
> jiffies_left: 19999
> Jan  7 15:15:58 vm04 kernel: *** operation didn't time out: da5000a0, 
> jiffies_left: 20000
> Jan  7 15:15:58 vm04 last message repeated 59 times
> ............
> 
> 
> The pvfs2-client debug log (from an earlier failure) proves that my 
> statement earlier about segfaulting in the 31st iteration false. This 
> one goes into the 70's.
> 
> [D 14:35:53.231333] cancel_op_in_progress called
> [D 14:35:53.231342] cancelling I/O req 0x88c7f98 from tag 1128
> [D 14:35:53.231357] PINT_client_io_cancel id 7350
> [D 14:35:53.231368] job iteration [0]
> [D 14:35:53.231377] [0] Posting cancellation of type: BMI Send (Request)
> [D 14:35:53.231386] [0] Posting cancellation of type: BMI Recv 
> (Response)
> [D 14:35:53.231396] [0] Posting cancellation of type: FLOW
> [D 14:35:53.231431] [0] Posting cancellation of type: BMI Recv (Write 
> Ack)
> [D 14:35:53.231444] job iteration [1]
> [D 14:35:53.231452] [1] Posting cancellation of type: BMI Send (Request)
> [D 14:35:53.231461] [1] Posting cancellation of type: BMI Recv 
> (Response)
> [D 14:35:53.231469] [1] Posting cancellation of type: BMI Recv (Write 
> Ack)
> [D 14:35:53.231502] job iteration [2]
> [D 14:35:53.231510] [2] Posting cancellation of type: BMI Send (Request)
> [D 14:35:53.231519] [2] Posting cancellation of type: BMI Recv 
> (Response)
> [D 14:35:53.231528] [2] Posting cancellation of type: FLOW
> [D 14:35:53.231536] [2] Posting cancellation of type: BMI Recv (Write 
> Ack)
> [D 14:35:53.231545] job iteration [3]
> [D 14:35:53.231553] [3] Posting cancellation of type: BMI Send (Request)
> [D 14:35:53.231562] [3] Posting cancellation of type: BMI Recv 
> (Response)
> [D 14:35:53.231571] [3] Posting cancellation of type: FLOW
> [D 14:35:53.231579] [3] Posting cancellation of type: BMI Recv (Write 
> Ack)
> [D 14:35:53.231588] job iteration [4]
> [D 14:35:53.231596] [4] Posting cancellation of type: BMI Send (Request)
> [D 14:35:53.231605] [4] Posting cancellation of type: BMI Recv 
> (Response)
> [D 14:35:53.231614] [4] Posting cancellation of type: FLOW
> [D 14:35:53.231622] [4] Posting cancellation of type: BMI Recv (Write 
> Ack)
> [D 14:35:53.231628] job iteration [5]
> [D 14:35:53.231666] [5] Posting cancellation of type: BMI Send (Request)
> [D 14:35:53.231675] [5] Posting cancellation of type: BMI Recv 
> (Response)
> [D 14:35:53.231684] [5] Posting cancellation of type: FLOW
> [D 14:35:53.231692] [5] Posting cancellation of type: BMI Recv (Write 
> Ack)
> [D 14:35:53.231701] job iteration [6]
> [D 14:35:53.231709] [6] Posting cancellation of type: BMI Send (Request)
> [D 14:35:53.231718] [6] Posting cancellation of type: BMI Recv 
> (Response)
> [D 14:35:53.231727] [6] Posting cancellation of type: FLOW
> [D 14:35:53.231735] [6] Posting cancellation of type: BMI Recv (Write 
> Ack)
> ............
> <snip>
> ............
> [D 14:35:53.235128] job iteration [69]
> [D 14:35:53.235136] [69] Posting cancellation of type: FLOW
> [D 14:35:53.235145] [69] Posting cancellation of type: BMI Recv (Write 
> Ack)
> [D 14:35:53.235153] job iteration [70]
> [D 14:35:53.235161] job iteration [71]
> [D 14:35:53.235169] job iteration [72]
> [D 14:35:53.235177] job iteration [73]
> [E 14:35:53.235212] PVFS2 client: signal 11, faulty address is 0xb8c0c8, 
> from 0xb26900
> [E 14:35:53.235810] [bt] 
> /usr/local/pvfs/lib/libpvfs2.so(PINT_client_io_cancel+0
> xe0) [0xb26900]
> [E 14:35:53.235828] [bt] pvfs2-client-core [0x804b926]
> [E 14:35:53.235836] [bt] pvfs2-client-core [0x804f141]
> [E 14:35:53.235844] [bt] pvfs2-client-core [0x8050c7b]
> [E 14:35:53.235852] [bt] pvfs2-client-core(main+0xe48) [0x8052498]
> 
> 
> I'll give the pending_ops change a try and post the results.
> 
> Michael
> 
>> 
>> Thanks,
>> 
>> -sam
>> 
>>> 
>>> Has anyone else seen the issue with client-core segfaulting on every 
>>> cancel op? Should the kernel wait_for_cancellation_downcall() be changed 
>>> to not allow indefinite looping? 
>>> 
>>> Thanks,
>>> Michael
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> [email protected]
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>> 


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to