Hi Ian, Kurtis,

Thanks for the reply. We are fixing the issue. But the point I wanted to 
bring it up here is the issue of a thread causing the go process to be in 
defunct state.
My kernel version is 
Linux version 4.14.175-1.nutanix.20200709.el7.x86_64 (dev@ca4b0551898c) 
(gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC)) #1 SMP Fri Jul 10 
02:17:54 UTC 2020

Thanks & Regards,
Uday Kiran

On Thursday, September 10, 2020 at 6:42:06 PM UTC-7 Ian Lance Taylor wrote:

> On Thu, Sep 10, 2020 at 5:09 PM Kurtis Rader <kra...@skepticism.us> wrote:
> >
> > A defunct process is a process that has terminated but whose parent 
> process has not called wait() or one of its variants. I don't know why lsof 
> still reports open files. It shouldn't since a dead process should have its 
> resources, such as its file descriptor table, freed by the kernel even if 
> the parent hasn't called wait(). You didn't tell us the details of the OS 
> you're using so I would simply assume it's a quirk of your OS. It might be 
> more productive to look into why your program is panicing at 
> map_faststr.go:275. A likely explanation is you have a race in your program 
> that is causing it to attempt to mutate a map concurrently or you're trying 
> to insert into a nil map.
>
> That's a good point. What OS are you using? I don't think you said.
>
> Ian
>
>
> > On Thu, Sep 10, 2020 at 4:43 PM Uday Kiran Jonnala <juday...@gmail.com> 
> wrote:
> >>
> >> Hi Ian,
> >>
> >> Again. Thanks for the reply. Problem here is we see go process is in 
> defunt process and sure parent process did not get SIGCHILD and looking 
> deeper,
> >> I see a thread in futex_wait_queue_me. If we think we are just getting 
> the stack trace and the go process actually got killed, why would I see
> >> associated fd's in file table and fd table is still intact (see lsof 
> information)
> >>
> >> Process which is in defunt state which got panic is <87548>, checking 
> for threads in this which is 87548
> >>
> >> bash-4.2# cat /proc/87548/status
> >> Name: replicator
> >> State: Z (zombie)
> >>
> >> bash-4.2# ls -Fl /proc/87548/task/87561/fd | grep 606649
> >> l-wx------. 1 root root 64 Aug 25 10:59 1 -> pipe:[606649]
> >> l-wx------. 1 root root 64 Aug 25 10:59 2 -> pipe:[606649]
> >>
> >> Listing the threads
> >>
> >> bash-4.2# ps -aefT | grep 87548
> >> root 87548 87548 87507 0 Aug23 ? 00:00:00 [replicator] <defunct>
> >> root 87548 87561 87507 0 Aug23 ? 00:00:00 [replicator] <defunct>
> >> root 112448 112448 42566 0 17:13 pts/0 00:00:00 grep 87548
> >>
> >> bash-4.2# lsof | grep 606649
> >> replicato 87548 87561 root 1w FIFO 0,11 0t0 606649 pipe
> >> replicato 87548 87561 root 2w FIFO 0,11 0t0 606649 pipe
> >>
> >> Why does lsof show the entry for the FIFO file of this process?
> >>
> >> So I feel we have a scenario the thread which is sleeping on 
> futex_wait_queue_me is not cleanup during panic() and causing the main
> >> thread to be exited leaving detached thread which waiting in 
> futex_wait_queue_me is still present.
> >>
> >> The main issue is I am not able to reproduce this, since this go 
> process is very big.
> >>
> >> Any way to verify this OR take it further.
> >>
> >> Thanks & Regards,
> >> Uday Kiran
> >> On Monday, September 7, 2020 at 12:05:05 PM UTC-7 Ian Lance Taylor 
> wrote:
> >>>
> >>> On Mon, Sep 7, 2020 at 12:03 AM Uday Kiran Jonnala <juday...@gmail.com> 
> wrote:
> >>> >
> >>> > Thanks for the reply, I get the point on zombie, I do not think the 
> issue here is parent not reaping child, seems like go process has not 
> finished execution of some
> >>> > internal threads (waiting on some futex) and causing SIGCHILD not to 
> be sent to parent.
> >>> >
> >>> > go process named <replicator> hit with panic and I see this went 
> into zombie state
> >>> >
> >>> > $ ps -ef | grep replicator
> >>> > root 87548 87507 0 Aug23 ? 00:00:00 [replicator] <defunct>
> >>> >
> >>> > Now looking at the tasks within the process
> >>> >
> >>> > I see the stack trace of the threads within the process still stuck 
> on following
> >>> >
> >>> > bash-4.2# cat /proc/87548/task/87561/stack
> >>> > [<ffffffffbb114714>] futex_wait_queue_me+0xc4/0x120
> >>> > [<ffffffffbb11520a>] futex_wait+0x10a/0x250
> >>> > [<ffffffffbb1182ce>] do_futex+0x35e/0x5b0
> >>> > [<ffffffffbb11865b>] SyS_futex+0x13b/0x180
> >>> > [<ffffffffbb003c09>] do_syscall_64+0x79/0x1b0
> >>> > [<ffffffffbba00081>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> >>> > [<ffffffffffffffff>] 0xffffffffffffffff
> >>> >
> >>> > From the above example if we are creating some internal threads and 
> main thread is excited due to panic and left some detached threads, process 
> will be in zombie state until the threads
> >>> > within the process completes.
> >>> >
> >>> > It appears there is some run away threads hung state scenario 
> causing this. I am not able to reproduce it with main go routine explict 
> panic and some go routine still executing.
> >>> >
> >>> > Does the above stack trace sound familiar wrt internal threads of Go 
> runtime ?
> >>>
> >>> If the process is defunct, then none of the thread stacks matter.
> >>> They are just where the thread happened to be when the process exited.
> >>>
> >>> What is the real problem you are seeing?
> >>>
> >>> Ian
> >>>
> >>>
> >>>
> >>>
> >>> > On Thursday, August 27, 2020 at 1:43:39 PM UTC-7 Ian Lance Taylor 
> wrote:
> >>> >>
> >>> >> On Thu, Aug 27, 2020 at 10:01 AM Uday Kiran Jonnala
> >>> >> <juday...@gmail.com> wrote:
> >>> >> >
> >>> >> > I have a situation on zombie parent scenario with golang
> >>> >> >
> >>> >> > A process (in the case replicator) has many goroutines internally
> >>> >> >
> >>> >> > We hit into panic() and I see the replicator process is in Zombie 
> state
> >>> >> >
> >>> >> > <<>>>:~$ ps -ef | grep replicator
> >>> >> >
> >>> >> > root 87548 87507 0 Aug23 ? 00:00:00 [replicator] <defunct>
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > Main go routine (or the supporting P) excited, but panic left the 
> other P thread to be still in executing state (main P could be 87548 and 
> supporting P thread 87561 is still there) in blocked state
> >>> >> >
> >>> >> > bash-4.2# ls -Fl /proc/87548/task/87561/fd | grep 
> 606649l-wx------. 1 root root 64 Aug 25 10:59 1 -> pipe:[606649]l-wx------. 
> 1 root root 64 Aug 25 10:59 2 -> pipe:[606649]
> >>> >> >
> >>> >> > Stack trace
> >>> >> >
> >>> >> > bash-4.2# cat /proc/87548/task/87561/stack[<ffffffffbb114714>] 
> futex_wait_queue_me+0xc4/0x120[<ffffffffbb11520a>] 
> futex_wait+0x10a/0x250[<ffffffffbb1182ce>] 
> do_futex+0x35e/0x5b0[<ffffffffbb11865b>] 
> SyS_futex+0x13b/0x180[<ffffffffbb003c09>] 
> do_syscall_64+0x79/0x1b0[<ffffffffbba00081>] 
> entry_SYSCALL_64_after_hwframe+0x3d/0xa2[<ffffffffffffffff>] 
> 0xffffffffffffffff
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > We have panic internally from main go routine
> >>> >> >
> >>> >> > fatal error: concurrent map writes
> >>> >> >
> >>> >> > goroutine 666359 [running]:
> >>> >> > runtime.throw(0x101d6ae, 0x15)
> >>> >> > 
> /home/ll/ntnx/toolchain-builds/78ae837ba07c8ef8f0ea782407d8d4626815552b.x86_64/go/src/runtime/panic.go:608
>  
> +0x72 fp=0xc00374b6f0 sp=0xc00374b6c0 pc=0x42da62
> >>> >> > runtime.mapassign_faststr(0xdb71c0, 0xc00023f5f0, 0xc000aca990, 
> 0x83, 0xc0009d03c8)
> >>> >> > 
> /home/ll/ntnx/toolchain-builds/78ae837ba07c8ef8f0ea782407d8d4626815552b.x86_64/go/src/runtime/map_faststr.go:275
>  
> +0x3bf fp=0xc00374b758 sp=0xc00374b6f0 pc=0x41527f
> >>> >> > 
> github.eng.nutanix.com/xyz/abc/metadata.UpdateRecvInProgressFlag(0xc000aca990,
>  
> 0x83, 0x0)
> >>> >> >
> >>> >> > .......
> >>> >> >
> >>> >> > goroutine 665516 [chan receive, 2 minutes]:
> >>> >> > zeus.(*Leadership).LeaderValue.func1(0xc003d5c120, 0x0, 
> 0xc002e906c0, 0x52, 0xc00302ec60, 0x29)
> >>> >> > /home/ll/ntnx/main/build/.go/src/zeus/leadership.go:244 +0x34
> >>> >> > created by zeus.(*Leadership).LeaderValue
> >>> >> > /home/ll/ntnx/main/build/.go/src/zeus/leadership.go:243 +0x277
> >>> >> > 2020-08-03 00:35:04 rolled over log file
> >>> >> > ERROR: logging before flag.Parse: I0803 00:35:04.426906 196123 
> dataset.go:26] initialize zfs linking
> >>> >> > ERROR: logging before flag.Parse: I0803 00:35:04.433296 196123 
> dataset.go:34] completed zfs linking successfully
> >>> >> > I0803 00:35:04.433447 196123 main.go:86] Gflags passed NodeUuid: 
> c238e584-0eeb-48bd-b299-2a25b13602f1, External Ip: 10.15.96.163
> >>> >> > I0803 00:35:04.433460 196123 main.go:99] Component name using for 
> this process : abc-c238e584-0eeb-48bd-b299-2a25b13602f1
> >>> >> > I0803 00:35:04.433467 196123 main.go:120] Trying to initialize DB
> >>> >> >
> >>> >> > If there is panic() from main P thread, as I understand we exit() 
> and cleanup all P threads of the process.
> >>> >> >
> >>> >> > Are we hitting into the following scenario, I did not look into 
> M-P-G implantation in detail.
> >>> >> >
> >>> >> > Example:
> >>> >> >
> >>> >> > #include <stdio.h>
> >>> >> > #include <pthread.h>
> >>> >> > #include <unistd.h>
> >>> >> > #include <stdlib.h>
> >>> >> >
> >>> >> > void *thread_function(void *args)
> >>> >> > {
> >>> >> > printf("The is new thread! Sleep 20 seconds...\n");
> >>> >> > sleep(100);
> >>> >> > printf("Exit from thread\n");
> >>> >> > pthread_exit(0);
> >>> >> > }
> >>> >> >
> >>> >> > int main(int argc, char **argv)
> >>> >> > {
> >>> >> > pthread_t thrd;
> >>> >> > pthread_attr_t attr;
> >>> >> > int res = 0;
> >>> >> > res = pthread_attr_init(&attr);
> >>> >> > res = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
> >>> >> > res = pthread_create(&thrd, &attr, thread_function, NULL);
> >>> >> > res = pthread_attr_destroy(&attr);
> >>> >> > printf("Main thread. Sleep 5 seconds\n");
> >>> >> > sleep(5);
> >>> >> > printf("Exit from main process\n");
> >>> >> > pthread_exit(0);
> >>> >> > }
> >>> >> >
> >>> >> > kkk@ ~/mycode/go () $ ./a.out &
> >>> >> > [1] 108418Main thread. Sleep 5 secondsThe is new thread! Sleep 20 
> seconds...
> >>> >> > kkk@ ~/mycode/go () $
> >>> >> > Exit from main processs
> >>> >> > PID TTY TIME CMD
> >>> >> > 49313 pts/26 00:00:01 bash108418 pts/26 00:00:00 [a.out] 
> <defunct>108449 pts/26 00:00:00 ps
> >>> >> >
> >>> >> > See the main process is <defunct> and child is still hanging 
> around
> >>> >> >
> >>> >> > kkk@ ~/mycode/go () $ sudo cat 
> /proc/108418/task/108420/stack[<ffffffff810b4c1d>] 
> hrtimer_nanosleep+0xbd/0x1d0[<ffffffff810b4dae>] 
> SyS_nanosleep+0x7e/0x90[<ffffffff816a63c9>] 
> system_call_fastpath+0x16/0x1b[<ffffffffffffffff>] 
> 0xffffffffffffffffujonnala@ ~/mycode/go () $ Exit from thread
> >>> >> >
> >>> >> > Any help in this regard is appreciated.
> >>> >>
> >>> >>
> >>> >> I think you are misreading something somewhere. Zombie status is a
> >>> >> feature of a process, not a thread. It means that the child process
> >>> >> has exited but that the parent process, the one which started the
> >>> >> child process via the fork system call (or, on GNU/Linux, the clone
> >>> >> system call), has not called the wait (or waitpid or wait3 or wait4)
> >>> >> system call to collect its status.
> >>> >>
> >>> >> So don't look at threads or P's. Look at the parent process that
> >>> >> started the process that became a zombie.
> >>> >>
> >>> >> Ian
> >>> >
> >>> > --
> >>> > You received this message because you are subscribed to the Google 
> Groups "golang-nuts" group.
> >>> > To unsubscribe from this group and stop receiving emails from it, 
> send an email to golang-nuts...@googlegroups.com.
> >>> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/f70e42f4-622d-4d91-b51d-ed00f2e11ac4n%40googlegroups.com
> .
> >>
> >> --
> >> You received this message because you are subscribed to the Google 
> Groups "golang-nuts" group.
> >> To unsubscribe from this group and stop receiving emails from it, send 
> an email to golang-nuts...@googlegroups.com.
> >> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/f1c6abc0-13b2-41ca-a365-fe0fbc7f129an%40googlegroups.com
> .
> >
> >
> >
> > --
> > Kurtis Rader
> > Caretaker of the exceptional canines Junior and Hank
> >
> > --
> > You received this message because you are subscribed to the Google 
> Groups "golang-nuts" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to golang-nuts...@googlegroups.com.
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/CABx2%3DD_Peg%2BMtJHGOwrqUKS%3D4JhPJgTS4WCMxocJWmX9J52VKg%40mail.gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/dd595085-f246-4b94-9fd5-ec53f699ddecn%40googlegroups.com.

Reply via email to