Re: [go-nuts] The program stucked on __vdso_clock_gettime (x86_64)

2022-07-05 Thread Neo Liu
>
>
> For me, this looks like the program is merely idling, and what you're
> observing is reading current time by the scheduler implemented by the Go
> runtime.
>
I didn't try it. The environment located in customer's data center, not
allowed to try this kind of things on it.



>


> [...]
> >1. Any body know the reason of the problem?
>
> I'm inclined to think that you're merely facing some deadlock or similar
> condition which precludes your normal Go code from doing useful progress
> while
> the runtime itself is OK.
>
> I agree with this, too. But the program didn't respond to any pprof
interaction (We implemented a pprof server using domain socket), so I can
not get goroutine information.



> >2. What can I do to get more information if this problem happened
> again?
>
> Do the following:
>
>  - Have a way to collect whatever the process is writing to its stderr.


>  - When the process wedges in the same way, send it SIGQUIT or SIGABRT so
> that
>it dumps the stacks of all the active goroutines to the stderr and
> exits.
>
> I  will try this next time.


> If that service handles HTTP requests, you might enable the handlers of the
> core debug/pprof package and when the process it apparently wedged
> navigate to
> its /debug/pprof/goroutine?debug=2 (unless you have grafted the root of
> those endpoints somewhere else) to have the stacks sent in the response.
>
> As I mentioned above, pprof is enabled by default, but the program ceased
to respond to anything.

Thanks very much.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAF_Sdb9imBONz%3DYL4%3DUQ9aqTSwLQT%3DN4eL7_g5AKAEdPsjGrOQ%40mail.gmail.com.


Re: [go-nuts] The program stucked on __vdso_clock_gettime (x86_64)

2022-07-05 Thread Konstantin Khomoutov
On Mon, Jul 04, 2022 at 01:57:06AM -0700, Chenhong Liu wrote:

[...]
> After profiling the process with perf command, the perf data shows the 
> program looped to call __vdso_clock_gettime .
> 
> 64.20% controllerd [vdso] [.] __vdso_clock_gettime 
> 8.82% controllerd controllerd [.] runtime.procyield
> 8.82% controllerd controllerd [.] runtime.suspendG 
> 8.01% controllerd controllerd [.] runtime.nanotime1
> 1.47% controllerd [kernel.kallsyms] [k] __enqueue_entity 
> 0.79% controllerd [kernel.kallsyms] [k] system_call_after_swapgs 
> 0.79% controllerd [kernel.kallsyms] [k] set_next_entity 
> 0.68% controllerd [kernel.kallsyms] [k] _raw_gspin_lock
> 0.68% controllerd [kernel.kallsyms] [k] change_pte_range
> 0.65% controllerd [kernel.kallsyms] [k] auditsys
> 0.57% controllerd [kernel.kallsyms] [k] update_curr 
> 0.45% controllerd [kernel.kallsyms] [k] native_sched_clock 
> 0.45% controllerd [kernel.kallsyms] [k] cpuacct_charge 
> 0.45% controllerd [kernel.kallsyms] [k] __x86_indirect_thunk_+rax 
> 0.45% controllerd [kernel.kallsyms] [k] __audit_syscall_exit
> 0.34% controllerd [kernel.kallsyms] [k] pick_next_task_fail
> 0.23% controllerd controllerd [.] runtime.osyield 
> 0.23% controllerd [kernel.kallsyms] [k] native_queuedc_spin_lock_slowpath
> 0.23% controllerd [kernel.kallsyms] [k] dput 
> 0.23% controllerd [kernel.kallsyms] [k] __schedule 
> 0.23% controllerd [kernel.kallsyms] [k] put_prev_task_fair 
> 0.23% controllerd [kernel.kallsyms] [k] yield_task_fair
> 0.22% controllerd [kernel.kallsyms] [k] sys_sched_yield 
> 0.11% controllerd [kernel.kallsyms] [k] clear_buddies 
> 0.11% controllerd [kernel.kallsyms] [k] update_rq_clock.part.78 
> 0.11% controllerd [kernel.kallsyms] [k] update_min_vruntime 
> 0.11% controllerd [kernel.kallsyms] [k] tick_do_update_jiffies64 
> 0.11% controllerd [kernel.kallsyms] [k] system_call 
> 0.11% controllerd [kernel.kallsyms] [k] rb_next 
> 0.11% controllerd [kernel.kallsyms] [k] rb_insert_color 
> 0.00% controllerd controllerd [.] runtime.notesleep 
> 0.00% controllerd [kernel.kallsyms] [k] load_balance 
> 0.00% controllerd controllerd [.] runtime.runggrab 
> 0.00% controllerd controllerd [.] runtime.findrunnable 
> 0.00% controllerd [kernel.kallsyms] [k] update_numa_stats 
> 0.00% controllerd [kernel.kallsyms] [k] __lock_task_sighand
> 0.00% controllerd [kernel.kallsyms] [k] idle_cpu
> 0.00% controllerd [kernel.kallsyms] [k] kmem_cache_free_bulk 
> 0.00% controllerd [kernel.kallsyms] [k] task_numa_find_cpu
> 0.00% controllerd [kernel.kallsyms] [k] __queue_work
> 0.00% controllerd [kernel.kallsyms] [k] __perf_event_task_sched_in
> 0.00% controllerd [kernel.kallsyms] [k] finish_task_switch
> 0.00% controllerd [kernel.kallsyms] [k] perf_ctx_unlock
> 0.00% controllerd [kernel.kallsyms] [k] native_write_msr_safe
> 0.00% controllerd [kernel.kallsyms] [k] __perf_event_enable
[...]

For me, this looks like the program is merely idling, and what you're
observing is reading current time by the scheduler implemented by the Go
runtime.

[...]
>1. Any body know the reason of the problem?

I'm inclined to think that you're merely facing some deadlock or similar
condition which precludes your normal Go code from doing useful progress while
the runtime itself is OK.

>2. What can I do to get more information if this problem happened again?

Do the following:

 - Have a way to collect whatever the process is writing to its stderr.

 - When the process wedges in the same way, send it SIGQUIT or SIGABRT so that
   it dumps the stacks of all the active goroutines to the stderr and exits.

If that service handles HTTP requests, you might enable the handlers of the
core debug/pprof package and when the process it apparently wedged navigate to
its /debug/pprof/goroutine?debug=2 (unless you have grafted the root of
those endpoints somewhere else) to have the stacks sent in the response.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/20220705095534.ay2wdgeinruwuf2w%40carbon.


[go-nuts] The program stucked on __vdso_clock_gettime (x86_64)

2022-07-04 Thread Chenhong Liu
Hi,

I encountered a wiered problem recently.

*Environment*

Arch: x86_64
CPU Vendor: hygon (AMD Zen1 OEM) (www.hygon.cn) (Hygon C86 7285)
OS: CentOS 7.6
Kernel: 3.10.0-957
Go version: 1.15.6

*What happened*

Our pure go program, named controllerd, stopped working except looping on 
one cpu core (from top command, it consumed one cpu core). Before it 
stopped working, it runs from mid April to Jun 12. 

After profiling the process with perf command, the perf data shows the 
program looped to call __vdso_clock_gettime .

64.20% controllerd [vdso] [.] __vdso_clock_gettime 
8.82% controllerd controllerd [.] runtime.procyield
8.82% controllerd controllerd [.] runtime.suspendG 
8.01% controllerd controllerd [.] runtime.nanotime1
1.47% controllerd [kernel.kallsyms] [k] __enqueue_entity 
0.79% controllerd [kernel.kallsyms] [k] system_call_after_swapgs 
0.79% controllerd [kernel.kallsyms] [k] set_next_entity 
0.68% controllerd [kernel.kallsyms] [k] _raw_gspin_lock
0.68% controllerd [kernel.kallsyms] [k] change_pte_range
0.65% controllerd [kernel.kallsyms] [k] auditsys
0.57% controllerd [kernel.kallsyms] [k] update_curr 
0.45% controllerd [kernel.kallsyms] [k] native_sched_clock 
0.45% controllerd [kernel.kallsyms] [k] cpuacct_charge 
0.45% controllerd [kernel.kallsyms] [k] __x86_indirect_thunk_+rax 
0.45% controllerd [kernel.kallsyms] [k] __audit_syscall_exit
0.34% controllerd [kernel.kallsyms] [k] pick_next_task_fail
0.23% controllerd controllerd [.] runtime.osyield 
0.23% controllerd [kernel.kallsyms] [k] native_queuedc_spin_lock_slowpath
0.23% controllerd [kernel.kallsyms] [k] dput 
0.23% controllerd [kernel.kallsyms] [k] __schedule 
0.23% controllerd [kernel.kallsyms] [k] put_prev_task_fair 
0.23% controllerd [kernel.kallsyms] [k] yield_task_fair
0.22% controllerd [kernel.kallsyms] [k] sys_sched_yield 
0.11% controllerd [kernel.kallsyms] [k] clear_buddies 
0.11% controllerd [kernel.kallsyms] [k] update_rq_clock.part.78 
0.11% controllerd [kernel.kallsyms] [k] update_min_vruntime 
0.11% controllerd [kernel.kallsyms] [k] tick_do_update_jiffies64 
0.11% controllerd [kernel.kallsyms] [k] system_call 
0.11% controllerd [kernel.kallsyms] [k] rb_next 
0.11% controllerd [kernel.kallsyms] [k] rb_insert_color 
0.00% controllerd controllerd [.] runtime.notesleep 
0.00% controllerd [kernel.kallsyms] [k] load_balance 
0.00% controllerd controllerd [.] runtime.runggrab 
0.00% controllerd controllerd [.] runtime.findrunnable 
0.00% controllerd [kernel.kallsyms] [k] update_numa_stats 
0.00% controllerd [kernel.kallsyms] [k] __lock_task_sighand
0.00% controllerd [kernel.kallsyms] [k] idle_cpu
0.00% controllerd [kernel.kallsyms] [k] kmem_cache_free_bulk 
0.00% controllerd [kernel.kallsyms] [k] task_numa_find_cpu
0.00% controllerd [kernel.kallsyms] [k] __queue_work
0.00% controllerd [kernel.kallsyms] [k] __perf_event_task_sched_in
0.00% controllerd [kernel.kallsyms] [k] finish_task_switch
0.00% controllerd [kernel.kallsyms] [k] perf_ctx_unlock
0.00% controllerd [kernel.kallsyms] [k] native_write_msr_safe
0.00% controllerd [kernel.kallsyms] [k] __perf_event_enable

B.T.W, perf command is:

perf record -o controllerd_perf.data -F 99 -p PID sleep 10

I'm not familiar with Go's runtime detail. But after getting the perf data 
and reading code of function suspendG of Go's runtime. I guessed: The 
program looping in suspendG and failed to find any other goroutine to 
execute.

*What was did*


   - Firstly, I suspected the clock source was not set correctly on the 
   system. But it's set to tsc, and it's right.
   - Then, I tried to send SIGUSR1 to the program to see if it could be 
   revived (the controllerd program can be triggered with SIGUSR1 to flush 
   logs), but nothing changed.
   - Finally, we have to restart the program to recover.
   

*Ask for help*


   1. Any body know the reason of the problem?
   2. What can I do to get more information if this problem happened again?
   
Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/1552be0a-3c36-4561-8fd0-2e649195a269n%40googlegroups.com.