Looks like some process hung there due to memory issues in kernel, error message from the very beginning would be helpful
Sent from my iPhone On 2012-6-9, at 上午8:26, Ling Ho <[email protected]> wrote: > Hi Anand, > > ulimit -l running as root is 64. > > > This dmesg out is from the second system. > > I don't see any new on the first system other that what were there when > system booted. > Do you want to see the whole dmesg output? Where should I post it, there are > 1600 lines. > > ... > ling > > INFO: task glusterfs:8880 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > glusterfs D 0000000000000000 0 8880 1 0x00000080 > ffff880614b75e48 0000000000000086 0000000000000000 ffff88010ed65d80 > 000000000000038b 000000000000038b ffff880614b75ee8 ffffffff814ef8f5 > ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78 > Call Trace: > [<ffffffff814ef8f5>] ? page_fault+0x25/0x30 > [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 > [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 > [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 > [<ffffffff814ee6c2>] ? down_write+0x32/0x40 > [<ffffffff81141768>] sys_munmap+0x48/0x80 > [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b > INFO: task glusterfs:8880 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > glusterfs D 0000000000000000 0 8880 1 0x00000080 > ffff880614b75e48 0000000000000086 0000000000000000 ffff88010ed65d80 > 000000000000038b 000000000000038b ffff880614b75ee8 ffffffff814ef8f5 > ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78 > Call Trace: > [<ffffffff814ef8f5>] ? page_fault+0x25/0x30 > [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 > [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 > [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 > [<ffffffff814ee6c2>] ? down_write+0x32/0x40 > [<ffffffff81141768>] sys_munmap+0x48/0x80 > [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b > INFO: task glusterfs:8880 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > glusterfs D 0000000000000009 0 8880 1 0x00000080 > ffff880614b75e08 0000000000000086 0000000000000000 ffff88062d638338 > ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88061406f740 > ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78 > Call Trace: > [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30 > [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 > [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 > [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 > [<ffffffff814ee6c2>] ? down_write+0x32/0x40 > [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0 > [<ffffffff81010469>] sys_mmap+0x29/0x30 > [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b > INFO: task glusterfs:8880 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > glusterfs D 0000000000000009 0 8880 1 0x00000080 > ffff880614b75e08 0000000000000086 0000000000000000 ffff88062d638338 > ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88061406f740 > ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78 > Call Trace: > [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30 > [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 > [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 > [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 > [<ffffffff814ee6c2>] ? down_write+0x32/0x40 > [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0 > [<ffffffff81010469>] sys_mmap+0x29/0x30 > [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b > INFO: task glusterfs:8880 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > glusterfs D 0000000000000003 0 8880 1 0x00000080 > ffff880614b75e08 0000000000000086 0000000000000000 ffff880630ab1ab8 > ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88062df10480 > ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78 > Call Trace: > [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30 > [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 > [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 > [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 > [<ffffffff814ee6c2>] ? down_write+0x32/0x40 > [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0 > [<ffffffff81010469>] sys_mmap+0x29/0x30 > [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b > INFO: task glusterfsd:9471 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > glusterfsd D 0000000000000004 0 9471 1 0x00000080 > ffff8801077c3740 0000000000000082 0000000000000000 ffff8801077c36b8 > ffffffff8127f138 0000000000000000 0000000000000000 ffff8801077c36d8 > ffff8806146f4638 ffff8801077c3fd8 000000000000f4e8 ffff8806146f4638 > Call Trace: > [<ffffffff8127f138>] ? swiotlb_dma_mapping_error+0x18/0x30 > [<ffffffff8127f138>] ? swiotlb_dma_mapping_error+0x18/0x30 > [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 > [<ffffffffa019607a>] ? ixgbe_xmit_frame_ring+0x93a/0xfc0 [ixgbe] > [<ffffffff814ef1f6>] rwsem_down_read_failed+0x26/0x30 > [<ffffffff81276e84>] call_rwsem_down_read_failed+0x14/0x30 > [<ffffffff814ee6f4>] ? down_read+0x24/0x30 > [<ffffffff81042bc7>] __do_page_fault+0x187/0x480 > [<ffffffff81430c38>] ? dev_queue_xmit+0x178/0x6b0 > [<ffffffff8146809c>] ? ip_finish_output+0x13c/0x310 > [<ffffffff814f253e>] do_page_fault+0x3e/0xa0 > [<ffffffff814ef8f5>] page_fault+0x25/0x30 > [<ffffffff81275a6d>] ? copy_user_generic_string+0x2d/0x40 > [<ffffffff81425655>] ? memcpy_toiovec+0x55/0x80 > [<ffffffff81426070>] skb_copy_datagram_iovec+0x60/0x2c0 > [<ffffffff8141ceac>] ? lock_sock_nested+0xac/0xc0 > [<ffffffff814ef5cb>] ? _spin_unlock_bh+0x1b/0x20 > [<ffffffff814722d5>] tcp_recvmsg+0xca5/0xe90 > [<ffffffff814925ea>] inet_recvmsg+0x5a/0x90 > [<ffffffff8141bff1>] sock_aio_read+0x181/0x190 > [<ffffffff810566a3>] ? perf_event_task_sched_out+0x33/0x80 > [<ffffffff8100988e>] ? __switch_to+0x26e/0x320 > [<ffffffff8141be70>] ? sock_aio_read+0x0/0x190 > [<ffffffff8117614b>] do_sync_readv_writev+0xfb/0x140 > [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 > [<ffffffff8120c1e6>] ? security_file_permission+0x16/0x20 > [<ffffffff811771df>] do_readv_writev+0xcf/0x1f0 > [<ffffffff811b9b50>] ? sys_epoll_wait+0xa0/0x300 > [<ffffffff814ecb0e>] ? thread_return+0x4e/0x760 > [<ffffffff81177513>] vfs_readv+0x43/0x60 > [<ffffffff81177641>] sys_readv+0x51/0xb0 > [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b > INFO: task glusterfsd:9545 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > glusterfsd D 0000000000000006 0 9545 1 0x00000080 > ffff880c24a7bcf8 0000000000000082 0000000000000000 ffffffff8107c0a0 > ffff88066a0a7580 ffff880c30460000 0000000000000000 0000000000000000 > ffff88066a0a7b38 ffff880c24a7bfd8 000000000000f4e8 ffff88066a0a7b38 > Call Trace: > [<ffffffff8107c0a0>] ? process_timeout+0x0/0x10 > [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 > [<ffffffff8127f18c>] ? is_swiotlb_buffer+0x3c/0x50 > [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 > [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 > [<ffffffff814ee6c2>] ? down_write+0x32/0x40 > [<ffffffffa0211b96>] ib_umem_release+0x76/0x110 [ib_core] > [<ffffffffa0230d52>] mlx4_ib_dereg_mr+0x32/0x50 [mlx4_ib] > [<ffffffffa020cd85>] ib_dereg_mr+0x35/0x50 [ib_core] > [<ffffffffa041bc5b>] ib_uverbs_dereg_mr+0x7b/0xf0 [ib_uverbs] > [<ffffffffa04194ef>] ib_uverbs_write+0xbf/0xe0 [ib_uverbs] > [<ffffffff8117646d>] ? rw_verify_area+0x5d/0xc0 > [<ffffffff81176588>] vfs_write+0xb8/0x1a0 > [<ffffffff810d4692>] ? audit_syscall_entry+0x272/0x2a0 > [<ffffffff81176f91>] sys_write+0x51/0x90 > [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b > INFO: task glusterfsd:9546 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > glusterfsd D 0000000000000004 0 9546 1 0x00000080 > ffff880c0634bcf0 0000000000000082 ffff880c0634bcb8 ffff880c0634bcb4 > 0000000000015f80 ffff88063fc24b00 ffff880655495f80 0000000000000400 > ffff880c2dccc5f8 ffff880c0634bfd8 000000000000f4e8 ffff880c2dccc5f8 > Call Trace: > [<ffffffff810566a3>] ? perf_event_task_sched_out+0x33/0x80 > [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 > [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320 > [<ffffffff814ef1f6>] rwsem_down_read_failed+0x26/0x30 > [<ffffffff814ecb0e>] ? thread_return+0x4e/0x760 > [<ffffffff81276e84>] call_rwsem_down_read_failed+0x14/0x30 > [<ffffffff814ee6f4>] ? down_read+0x24/0x30 > [<ffffffff81042bc7>] __do_page_fault+0x187/0x480 > [<ffffffffa0419e16>] ? ib_uverbs_event_read+0x1d6/0x240 [ib_uverbs] > [<ffffffff814f253e>] do_page_fault+0x3e/0xa0 > [<ffffffff814ef8f5>] page_fault+0x25/0x30 > INFO: task glusterfsd:9553 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > glusterfsd D 000000000000000e 0 9553 1 0x00000080 > ffff8806e131dd98 0000000000000082 0000000000000000 ffff8806e131dd64 > ffff8806e131dd48 ffffffffa026dfb6 ffff8806e131dd28 ffffffff00000000 > ffff880c2f41c678 ffff8806e131dfd8 000000000000f4e8 ffff880c2f41c678 > Call Trace: > [<ffffffffa026dfb6>] ? xfs_attr_get+0xb6/0xc0 [xfs] > [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 > [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 > [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 > [<ffffffff814ee6c2>] ? down_write+0x32/0x40 > [<ffffffff81136009>] sys_madvise+0x329/0x760 > [<ffffffff81195740>] ? mntput_no_expire+0x30/0x110 > [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b > INFO: task glusterfs:8880 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > glusterfs D 0000000000000003 0 8880 1 0x00000080 > ffff880614b75e08 0000000000000086 0000000000000000 ffff880630ab1ab8 > ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88062df10480 > ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78 > Call Trace: > [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30 > [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 > [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 > [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 > [<ffffffff814ee6c2>] ? down_write+0x32/0x40 > [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0 > [<ffffffff81010469>] sys_mmap+0x29/0x30 > [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b > > > On 06/08/2012 05:18 PM, Anand Avati wrote: >> >> Those are 4.x GB. Can you post dmesg output as well? Also, what's 'ulimit >> -l' on your system? >> >> On Fri, Jun 8, 2012 at 4:41 PM, Ling Ho <[email protected]> wrote: >> >> This is the core file from the crash just now >> >> [root@psanaoss213 /]# ls -al core* >> -rw------- 1 root root 4073594880 Jun 8 15:05 core.22682 >> >> From yesterday: >> [root@psanaoss214 /]# ls -al core* >> -rw------- 1 root root 4362727424 Jun 8 00:58 core.13483 >> -rw------- 1 root root 4624773120 Jun 8 03:21 core.8792 >> >> >> >> On 06/08/2012 04:34 PM, Anand Avati wrote: >>> >>> Is it possible the system was running low on memory? I see you have 48GB, >>> but memory registration failure typically would be because the system limit >>> on the number of pinnable pages in RAM was hit. Can you tell us the size of >>> your core dump files after the crash? >>> >>> Avati >>> >>> On Fri, Jun 8, 2012 at 4:22 PM, Ling Ho <[email protected]> wrote: >>> Hello, >>> >>> I have a brick that crashed twice today, and another different brick that >>> crashed just a while a go. >>> >>> This is what I see in one of the brick logs: >>> >>> patchset: git://git.gluster.com/glusterfs.git >>> patchset: git://git.gluster.com/glusterfs.git >>> signal received: 6 >>> signal received: 6 >>> time of crash: 2012-06-08 15:05:11 >>> configuration details: >>> argp 1 >>> backtrace 1 >>> dlfcn 1 >>> fdatasync 1 >>> libpthread 1 >>> llistxattr 1 >>> setfsid 1 >>> spinlock 1 >>> epoll.h 1 >>> xattr.h 1 >>> st_atim.tv_nsec 1 >>> package-string: glusterfs 3.2.6 >>> /lib64/libc.so.6[0x34bc032900] >>> /lib64/libc.so.6(gsignal+0x35)[0x34bc032885] >>> /lib64/libc.so.6(abort+0x175)[0x34bc034065] >>> /lib64/libc.so.6[0x34bc06f977] >>> /lib64/libc.so.6[0x34bc075296] >>> /opt/glusterfs/3.2.6/lib64/libglusterfs.so.0(__gf_free+0x44)[0x7f1740ba25e4] >>> /opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_transport_destroy+0x47)[0x7f1740956967] >>> /opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_transport_unref+0x62)[0x7f1740956a32] >>> /opt/glusterfs/3.2.6/lib64/glusterfs/3.2.6/rpc-transport/rdma.so(+0xc135)[0x7f173ca27135] >>> /lib64/libpthread.so.0[0x34bc8077f1] >>> /lib64/libc.so.6(clone+0x6d)[0x34bc0e5ccd] >>> --------- >>> >>> And somewhere before these, there is also >>> [2012-06-08 15:05:07.512604] E [rdma.c:198:rdma_new_post] >>> 0-rpc-transport/rdma: memory registration failed >>> >>> I have 48GB of memory on the system: >>> >>> # free >>> total used free shared >>> buffers cached >>> Mem: 49416716 34496648 14920068 0 31692 28209612 >>> -/+ buffers/cache: 6255344 43161372 >>> Swap: 4194296 1740 4192556 >>> >>> # uname -a >>> Linux psanaoss213 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST >>> 2012 x86_64 x86_64 x86_64 GNU/Linux >>> >>> The server gluster versions is 3.2.6-1. I am using have both rdma clients >>> and tcp clients over 10Gb/s network. >>> >>> Any suggestion what I should look for? >>> >>> Is there a way to just restart the brick, and not glusterd on the server? I >>> have 8 bricks on the server. >>> >>> Thanks, >>> ... >>> ling >>> >>> >>> Here's the volume info: >>> >>> # gluster volume info >>> >>> Volume Name: ana12 >>> Type: Distribute >>> Status: Started >>> Number of Bricks: 40 >>> Transport-type: tcp,rdma >>> Bricks: >>> Brick1: psanaoss214:/brick1 >>> Brick2: psanaoss214:/brick2 >>> Brick3: psanaoss214:/brick3 >>> Brick4: psanaoss214:/brick4 >>> Brick5: psanaoss214:/brick5 >>> Brick6: psanaoss214:/brick6 >>> Brick7: psanaoss214:/brick7 >>> Brick8: psanaoss214:/brick8 >>> Brick9: psanaoss211:/brick1 >>> Brick10: psanaoss211:/brick2 >>> Brick11: psanaoss211:/brick3 >>> Brick12: psanaoss211:/brick4 >>> Brick13: psanaoss211:/brick5 >>> Brick14: psanaoss211:/brick6 >>> Brick15: psanaoss211:/brick7 >>> Brick16: psanaoss211:/brick8 >>> Brick17: psanaoss212:/brick1 >>> Brick18: psanaoss212:/brick2 >>> Brick19: psanaoss212:/brick3 >>> Brick20: psanaoss212:/brick4 >>> Brick21: psanaoss212:/brick5 >>> Brick22: psanaoss212:/brick6 >>> Brick23: psanaoss212:/brick7 >>> Brick24: psanaoss212:/brick8 >>> Brick25: psanaoss213:/brick1 >>> Brick26: psanaoss213:/brick2 >>> Brick27: psanaoss213:/brick3 >>> Brick28: psanaoss213:/brick4 >>> Brick29: psanaoss213:/brick5 >>> Brick30: psanaoss213:/brick6 >>> Brick31: psanaoss213:/brick7 >>> Brick32: psanaoss213:/brick8 >>> Brick33: psanaoss215:/brick1 >>> Brick34: psanaoss215:/brick2 >>> Brick35: psanaoss215:/brick4 >>> Brick36: psanaoss215:/brick5 >>> Brick37: psanaoss215:/brick7 >>> Brick38: psanaoss215:/brick8 >>> Brick39: psanaoss215:/brick3 >>> Brick40: psanaoss215:/brick6 >>> Options Reconfigured: >>> performance.io-thread-count: 16 >>> performance.write-behind-window-size: 16MB >>> performance.cache-size: 1GB >>> nfs.disable: on >>> performance.cache-refresh-timeout: 1 >>> network.ping-timeout: 42 >>> performance.cache-max-file-size: 1PB >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> [email protected] >>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>> >> >> > > _______________________________________________ > Gluster-users mailing list > [email protected] > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list [email protected] http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
