Hi Yuhao, On Mon, Aug 6, 2018 at 6:57 AM Yuhao Zhang <[email protected]> wrote:
> Atin, that was my typo... I think it was glusterfsd, but not 100% sure. I > will keep an eye when it happens next time. > > Thank you all for looking into this! I tried another transfer earlier > today but it didn't get the chance to reach the point where glusterfsd > starts to fail before we needed to start the production mission critical > jobs. I am going to try another time next week and report back to the > thread. > > Btw, I took a look into syslog and find this event happened right at the > time when clients started to hang: > > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142702] INFO: task > glusteriotwr3:6895 blocked for more than 120 seconds. > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142760] Tainted: P > O 4.4.0-116-generic #140-Ubuntu > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142805] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142863] glusteriotwr3 D > ffff88102c18fb98 0 6895 1 0x00000000 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142872] ffff88102c18fb98 > ffff88102c18fb68 ffff88085be0b800 ffff88103f314600 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142877] ffff88102c190000 > ffff8805397fe0f4 ffff88103f314600 00000000ffffffff > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142882] ffff8805397fe0f8 > ffff88102c18fbb0 ffffffff8184ae45 ffff8805397fe0f0 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142887] Call Trace: > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142899] [<ffffffff8184ae45>] > schedule+0x35/0x80 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142904] [<ffffffff8184b0ee>] > schedule_preempt_disabled+0xe/0x10 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142910] [<ffffffff8184cd29>] > __mutex_lock_slowpath+0xb9/0x130 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142915] [<ffffffff8184cdbf>] > mutex_lock+0x1f/0x30 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142922] [<ffffffff8122071d>] > walk_component+0x21d/0x310 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142926] [<ffffffff81221ba1>] > link_path_walk+0x191/0x5c0 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142931] [<ffffffff8121ff3b>] > ? path_init+0x1eb/0x3c0 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142935] [<ffffffff812220cc>] > path_lookupat+0x7c/0x110 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142940] [<ffffffff8123a828>] > ? __vfs_setxattr_noperm+0x128/0x1a0 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142944] [<ffffffff81223d01>] > filename_lookup+0xb1/0x180 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142948] [<ffffffff8123aadb>] > ? setxattr+0x18b/0x200 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142955] [<ffffffff811f2109>] > ? kmem_cache_alloc+0x189/0x1f0 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142960] [<ffffffff81223906>] > ? getname_flags+0x56/0x1f0 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142964] [<ffffffff81223ea6>] > user_path_at_empty+0x36/0x40 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142970] [<ffffffff81218d76>] > vfs_fstatat+0x66/0xc0 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142975] [<ffffffff81219331>] > SYSC_newlstat+0x31/0x60 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142980] [<ffffffff8121dade>] > ? path_put+0x1e/0x30 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142984] [<ffffffff8123ac0f>] > ? path_setxattr+0xbf/0xe0 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142989] [<ffffffff8121946e>] > SyS_newlstat+0xe/0x10 > Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142996] [<ffffffff8184efc8>] > entry_SYSCALL_64_fastpath+0x1c/0xbb > > Which version of ZFS are you using ? it seems like a hang inside ZFS. Xavi > > > > On Aug 5, 2018, at 08:39, Atin Mukherjee <[email protected]> > wrote: > > > > On Sun, 5 Aug 2018 at 13:29, Yuhao Zhang <[email protected]> wrote: > >> Sorry, what I meant was, if I start the transfer now and get glusterd >> into zombie status, >> > > glusterd or glusterfsd? > > it's unlikely that I can fully recover the server without a reboot. >> >> >> On Aug 5, 2018, at 02:55, Raghavendra Gowdappa <[email protected]> >> wrote: >> >> >> >> On Sun, Aug 5, 2018 at 1:22 PM, Yuhao Zhang <[email protected]> wrote: >> >>> This is a semi-production server and I can't bring it down right now. >>> Will try to get the monitoring output when I get a chance. >>> >> >> Collecting top output doesn't require to bring down servers. >> >> >>> As I recall, the high CPU processes are brick daemons (glusterfsd) and >>> htop showed they were in status D. However, I saw zero zpool IO as clients >>> were all hanging. >>> >>> >>> On Aug 5, 2018, at 02:38, Raghavendra Gowdappa <[email protected]> >>> wrote: >>> >>> >>> >>> On Sun, Aug 5, 2018 at 12:44 PM, Yuhao Zhang <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> I am running into a situation that heavy write causes Gluster server >>>> went into zombie with many high CPU processes and all clients hangs, it is >>>> almost 100% reproducible on my machine. Hope someone can help. >>>> >>> >>> Can you give us the output of monitioring these processes with High cpu >>> usage captured in the duration when your tests are running? >>> >>> >>> - MON_INTERVAL=10 # can be increased for very long runs >>> - top -bd $MON_INTERVAL > /tmp/top_proc.${HOSTNAME}.txt # CPU >>> utilization by process >>> - top -bHd $MON_INTERVAL > /tmp/top_thr.${HOSTNAME}.txt # CPU >>> utilization by thread >>> >>> >>> >>>> I started to observe this issue when running rsync to copy files from >>>> another server and I thought it might be because Gluster doesn't like >>>> rsync's delta transfer with a lot of small writes. However, I was able to >>>> reproduce this with "rsync --whole-file --inplace", or even with cp or scp. >>>> It usually appears after starting the transfer for a few hours, but >>>> sometimes can happen within several minutes. >>>> >>>> Since this is a single node Gluster distributed volume, I tried to >>>> transfer files directly onto the server bypassing Gluster clients, but it >>>> still caused the same issue. >>>> >>>> It is running on top of a ZFS RAIDZ2 dataset. Options are attached. >>>> Also, I attached the statedump generated when my clients hung, and volume >>>> options. >>>> >>>> - Ubuntu 16.04 x86_64 / 4.4.0-116-generic >>>> - GlusterFS 3.12.8 >>>> >>>> Thank you, >>>> Yuhao >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> [email protected] >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> >>> >> >> _______________________________________________ >> Gluster-users mailing list >> [email protected] >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -- > --Atin > > > _______________________________________________ > Gluster-users mailing list > [email protected] > https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-users
