Hi Ian, Kurtis, Thanks for the reply. We are fixing the issue. But the point I wanted to bring it up here is the issue of a thread causing the go process to be in defunct state. My kernel version is Linux version 4.14.175-1.nutanix.20200709.el7.x86_64 (dev@ca4b0551898c) (gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC)) #1 SMP Fri Jul 10 02:17:54 UTC 2020
Thanks & Regards, Uday Kiran On Thursday, September 10, 2020 at 6:42:06 PM UTC-7 Ian Lance Taylor wrote: > On Thu, Sep 10, 2020 at 5:09 PM Kurtis Rader <kra...@skepticism.us> wrote: > > > > A defunct process is a process that has terminated but whose parent > process has not called wait() or one of its variants. I don't know why lsof > still reports open files. It shouldn't since a dead process should have its > resources, such as its file descriptor table, freed by the kernel even if > the parent hasn't called wait(). You didn't tell us the details of the OS > you're using so I would simply assume it's a quirk of your OS. It might be > more productive to look into why your program is panicing at > map_faststr.go:275. A likely explanation is you have a race in your program > that is causing it to attempt to mutate a map concurrently or you're trying > to insert into a nil map. > > That's a good point. What OS are you using? I don't think you said. > > Ian > > > > On Thu, Sep 10, 2020 at 4:43 PM Uday Kiran Jonnala <juday...@gmail.com> > wrote: > >> > >> Hi Ian, > >> > >> Again. Thanks for the reply. Problem here is we see go process is in > defunt process and sure parent process did not get SIGCHILD and looking > deeper, > >> I see a thread in futex_wait_queue_me. If we think we are just getting > the stack trace and the go process actually got killed, why would I see > >> associated fd's in file table and fd table is still intact (see lsof > information) > >> > >> Process which is in defunt state which got panic is <87548>, checking > for threads in this which is 87548 > >> > >> bash-4.2# cat /proc/87548/status > >> Name: replicator > >> State: Z (zombie) > >> > >> bash-4.2# ls -Fl /proc/87548/task/87561/fd | grep 606649 > >> l-wx------. 1 root root 64 Aug 25 10:59 1 -> pipe:[606649] > >> l-wx------. 1 root root 64 Aug 25 10:59 2 -> pipe:[606649] > >> > >> Listing the threads > >> > >> bash-4.2# ps -aefT | grep 87548 > >> root 87548 87548 87507 0 Aug23 ? 00:00:00 [replicator] <defunct> > >> root 87548 87561 87507 0 Aug23 ? 00:00:00 [replicator] <defunct> > >> root 112448 112448 42566 0 17:13 pts/0 00:00:00 grep 87548 > >> > >> bash-4.2# lsof | grep 606649 > >> replicato 87548 87561 root 1w FIFO 0,11 0t0 606649 pipe > >> replicato 87548 87561 root 2w FIFO 0,11 0t0 606649 pipe > >> > >> Why does lsof show the entry for the FIFO file of this process? > >> > >> So I feel we have a scenario the thread which is sleeping on > futex_wait_queue_me is not cleanup during panic() and causing the main > >> thread to be exited leaving detached thread which waiting in > futex_wait_queue_me is still present. > >> > >> The main issue is I am not able to reproduce this, since this go > process is very big. > >> > >> Any way to verify this OR take it further. > >> > >> Thanks & Regards, > >> Uday Kiran > >> On Monday, September 7, 2020 at 12:05:05 PM UTC-7 Ian Lance Taylor > wrote: > >>> > >>> On Mon, Sep 7, 2020 at 12:03 AM Uday Kiran Jonnala <juday...@gmail.com> > wrote: > >>> > > >>> > Thanks for the reply, I get the point on zombie, I do not think the > issue here is parent not reaping child, seems like go process has not > finished execution of some > >>> > internal threads (waiting on some futex) and causing SIGCHILD not to > be sent to parent. > >>> > > >>> > go process named <replicator> hit with panic and I see this went > into zombie state > >>> > > >>> > $ ps -ef | grep replicator > >>> > root 87548 87507 0 Aug23 ? 00:00:00 [replicator] <defunct> > >>> > > >>> > Now looking at the tasks within the process > >>> > > >>> > I see the stack trace of the threads within the process still stuck > on following > >>> > > >>> > bash-4.2# cat /proc/87548/task/87561/stack > >>> > [<ffffffffbb114714>] futex_wait_queue_me+0xc4/0x120 > >>> > [<ffffffffbb11520a>] futex_wait+0x10a/0x250 > >>> > [<ffffffffbb1182ce>] do_futex+0x35e/0x5b0 > >>> > [<ffffffffbb11865b>] SyS_futex+0x13b/0x180 > >>> > [<ffffffffbb003c09>] do_syscall_64+0x79/0x1b0 > >>> > [<ffffffffbba00081>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > >>> > [<ffffffffffffffff>] 0xffffffffffffffff > >>> > > >>> > From the above example if we are creating some internal threads and > main thread is excited due to panic and left some detached threads, process > will be in zombie state until the threads > >>> > within the process completes. > >>> > > >>> > It appears there is some run away threads hung state scenario > causing this. I am not able to reproduce it with main go routine explict > panic and some go routine still executing. > >>> > > >>> > Does the above stack trace sound familiar wrt internal threads of Go > runtime ? > >>> > >>> If the process is defunct, then none of the thread stacks matter. > >>> They are just where the thread happened to be when the process exited. > >>> > >>> What is the real problem you are seeing? > >>> > >>> Ian > >>> > >>> > >>> > >>> > >>> > On Thursday, August 27, 2020 at 1:43:39 PM UTC-7 Ian Lance Taylor > wrote: > >>> >> > >>> >> On Thu, Aug 27, 2020 at 10:01 AM Uday Kiran Jonnala > >>> >> <juday...@gmail.com> wrote: > >>> >> > > >>> >> > I have a situation on zombie parent scenario with golang > >>> >> > > >>> >> > A process (in the case replicator) has many goroutines internally > >>> >> > > >>> >> > We hit into panic() and I see the replicator process is in Zombie > state > >>> >> > > >>> >> > <<>>>:~$ ps -ef | grep replicator > >>> >> > > >>> >> > root 87548 87507 0 Aug23 ? 00:00:00 [replicator] <defunct> > >>> >> > > >>> >> > > >>> >> > > >>> >> > Main go routine (or the supporting P) excited, but panic left the > other P thread to be still in executing state (main P could be 87548 and > supporting P thread 87561 is still there) in blocked state > >>> >> > > >>> >> > bash-4.2# ls -Fl /proc/87548/task/87561/fd | grep > 606649l-wx------. 1 root root 64 Aug 25 10:59 1 -> pipe:[606649]l-wx------. > 1 root root 64 Aug 25 10:59 2 -> pipe:[606649] > >>> >> > > >>> >> > Stack trace > >>> >> > > >>> >> > bash-4.2# cat /proc/87548/task/87561/stack[<ffffffffbb114714>] > futex_wait_queue_me+0xc4/0x120[<ffffffffbb11520a>] > futex_wait+0x10a/0x250[<ffffffffbb1182ce>] > do_futex+0x35e/0x5b0[<ffffffffbb11865b>] > SyS_futex+0x13b/0x180[<ffffffffbb003c09>] > do_syscall_64+0x79/0x1b0[<ffffffffbba00081>] > entry_SYSCALL_64_after_hwframe+0x3d/0xa2[<ffffffffffffffff>] > 0xffffffffffffffff > >>> >> > > >>> >> > > >>> >> > > >>> >> > We have panic internally from main go routine > >>> >> > > >>> >> > fatal error: concurrent map writes > >>> >> > > >>> >> > goroutine 666359 [running]: > >>> >> > runtime.throw(0x101d6ae, 0x15) > >>> >> > > /home/ll/ntnx/toolchain-builds/78ae837ba07c8ef8f0ea782407d8d4626815552b.x86_64/go/src/runtime/panic.go:608 > > +0x72 fp=0xc00374b6f0 sp=0xc00374b6c0 pc=0x42da62 > >>> >> > runtime.mapassign_faststr(0xdb71c0, 0xc00023f5f0, 0xc000aca990, > 0x83, 0xc0009d03c8) > >>> >> > > /home/ll/ntnx/toolchain-builds/78ae837ba07c8ef8f0ea782407d8d4626815552b.x86_64/go/src/runtime/map_faststr.go:275 > > +0x3bf fp=0xc00374b758 sp=0xc00374b6f0 pc=0x41527f > >>> >> > > github.eng.nutanix.com/xyz/abc/metadata.UpdateRecvInProgressFlag(0xc000aca990, > > 0x83, 0x0) > >>> >> > > >>> >> > ....... > >>> >> > > >>> >> > goroutine 665516 [chan receive, 2 minutes]: > >>> >> > zeus.(*Leadership).LeaderValue.func1(0xc003d5c120, 0x0, > 0xc002e906c0, 0x52, 0xc00302ec60, 0x29) > >>> >> > /home/ll/ntnx/main/build/.go/src/zeus/leadership.go:244 +0x34 > >>> >> > created by zeus.(*Leadership).LeaderValue > >>> >> > /home/ll/ntnx/main/build/.go/src/zeus/leadership.go:243 +0x277 > >>> >> > 2020-08-03 00:35:04 rolled over log file > >>> >> > ERROR: logging before flag.Parse: I0803 00:35:04.426906 196123 > dataset.go:26] initialize zfs linking > >>> >> > ERROR: logging before flag.Parse: I0803 00:35:04.433296 196123 > dataset.go:34] completed zfs linking successfully > >>> >> > I0803 00:35:04.433447 196123 main.go:86] Gflags passed NodeUuid: > c238e584-0eeb-48bd-b299-2a25b13602f1, External Ip: 10.15.96.163 > >>> >> > I0803 00:35:04.433460 196123 main.go:99] Component name using for > this process : abc-c238e584-0eeb-48bd-b299-2a25b13602f1 > >>> >> > I0803 00:35:04.433467 196123 main.go:120] Trying to initialize DB > >>> >> > > >>> >> > If there is panic() from main P thread, as I understand we exit() > and cleanup all P threads of the process. > >>> >> > > >>> >> > Are we hitting into the following scenario, I did not look into > M-P-G implantation in detail. > >>> >> > > >>> >> > Example: > >>> >> > > >>> >> > #include <stdio.h> > >>> >> > #include <pthread.h> > >>> >> > #include <unistd.h> > >>> >> > #include <stdlib.h> > >>> >> > > >>> >> > void *thread_function(void *args) > >>> >> > { > >>> >> > printf("The is new thread! Sleep 20 seconds...\n"); > >>> >> > sleep(100); > >>> >> > printf("Exit from thread\n"); > >>> >> > pthread_exit(0); > >>> >> > } > >>> >> > > >>> >> > int main(int argc, char **argv) > >>> >> > { > >>> >> > pthread_t thrd; > >>> >> > pthread_attr_t attr; > >>> >> > int res = 0; > >>> >> > res = pthread_attr_init(&attr); > >>> >> > res = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); > >>> >> > res = pthread_create(&thrd, &attr, thread_function, NULL); > >>> >> > res = pthread_attr_destroy(&attr); > >>> >> > printf("Main thread. Sleep 5 seconds\n"); > >>> >> > sleep(5); > >>> >> > printf("Exit from main process\n"); > >>> >> > pthread_exit(0); > >>> >> > } > >>> >> > > >>> >> > kkk@ ~/mycode/go () $ ./a.out & > >>> >> > [1] 108418Main thread. Sleep 5 secondsThe is new thread! Sleep 20 > seconds... > >>> >> > kkk@ ~/mycode/go () $ > >>> >> > Exit from main processs > >>> >> > PID TTY TIME CMD > >>> >> > 49313 pts/26 00:00:01 bash108418 pts/26 00:00:00 [a.out] > <defunct>108449 pts/26 00:00:00 ps > >>> >> > > >>> >> > See the main process is <defunct> and child is still hanging > around > >>> >> > > >>> >> > kkk@ ~/mycode/go () $ sudo cat > /proc/108418/task/108420/stack[<ffffffff810b4c1d>] > hrtimer_nanosleep+0xbd/0x1d0[<ffffffff810b4dae>] > SyS_nanosleep+0x7e/0x90[<ffffffff816a63c9>] > system_call_fastpath+0x16/0x1b[<ffffffffffffffff>] > 0xffffffffffffffffujonnala@ ~/mycode/go () $ Exit from thread > >>> >> > > >>> >> > Any help in this regard is appreciated. > >>> >> > >>> >> > >>> >> I think you are misreading something somewhere. Zombie status is a > >>> >> feature of a process, not a thread. It means that the child process > >>> >> has exited but that the parent process, the one which started the > >>> >> child process via the fork system call (or, on GNU/Linux, the clone > >>> >> system call), has not called the wait (or waitpid or wait3 or wait4) > >>> >> system call to collect its status. > >>> >> > >>> >> So don't look at threads or P's. Look at the parent process that > >>> >> started the process that became a zombie. > >>> >> > >>> >> Ian > >>> > > >>> > -- > >>> > You received this message because you are subscribed to the Google > Groups "golang-nuts" group. > >>> > To unsubscribe from this group and stop receiving emails from it, > send an email to golang-nuts...@googlegroups.com. > >>> > To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/f70e42f4-622d-4d91-b51d-ed00f2e11ac4n%40googlegroups.com > . > >> > >> -- > >> You received this message because you are subscribed to the Google > Groups "golang-nuts" group. > >> To unsubscribe from this group and stop receiving emails from it, send > an email to golang-nuts...@googlegroups.com. > >> To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/f1c6abc0-13b2-41ca-a365-fe0fbc7f129an%40googlegroups.com > . > > > > > > > > -- > > Kurtis Rader > > Caretaker of the exceptional canines Junior and Hank > > > > -- > > You received this message because you are subscribed to the Google > Groups "golang-nuts" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to golang-nuts...@googlegroups.com. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/CABx2%3DD_Peg%2BMtJHGOwrqUKS%3D4JhPJgTS4WCMxocJWmX9J52VKg%40mail.gmail.com > . > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/dd595085-f246-4b94-9fd5-ec53f699ddecn%40googlegroups.com.