I have found that GODEBUG=schedtrace still prints when the program is frozen but I cannot find much documentation on deciphering its output. The full output is here: https://gist.github.com/immesys/7f213c63c54ce60ba72d1cbc9ffcc4cb but here are the first few lines. It looks like this is garbage collector related?
SCHED 111364ms: gomaxprocs=40 idleprocs=0 threads=78 spinningthreads=0 idlethreads=65 runqueue=31 gcwaiting=1 nmidlelocked=1 stopwait=1 sysmonwait=0 P0: status=3 schedtick=36699 syscalltick=29158 m=-1 runqsize=0 gfreecnt=45 P1: status=3 schedtick=37370 syscalltick=30563 m=-1 runqsize=0 gfreecnt=31 P2: status=3 schedtick=36650 syscalltick=29478 m=-1 runqsize=0 gfreecnt=17 P3: status=3 schedtick=42395 syscalltick=34028 m=-1 runqsize=0 gfreecnt=38 P4: status=3 schedtick=40531 syscalltick=33546 m=-1 runqsize=0 gfreecnt=27 P5: status=3 schedtick=41369 syscalltick=31884 m=-1 runqsize=0 gfreecnt=19 P6: status=3 schedtick=39302 syscalltick=32162 m=-1 runqsize=0 gfreecnt=39 P7: status=3 schedtick=39739 syscalltick=32624 m=-1 runqsize=0 gfreecnt=34 P8: status=3 schedtick=32585 syscalltick=24358 m=-1 runqsize=0 gfreecnt=45 P9: status=3 schedtick=40376 syscalltick=30399 m=-1 runqsize=0 gfreecnt=27 P10: status=3 schedtick=31018 syscalltick=22164 m=-1 runqsize=0 gfreecnt=40 P11: status=3 schedtick=32043 syscalltick=24479 m=-1 runqsize=0 gfreecnt=53 P12: status=3 schedtick=36459 syscalltick=25222 m=-1 runqsize=0 gfreecnt=21 P13: status=3 schedtick=38445 syscalltick=28317 m=-1 runqsize=0 gfreecnt=38 P14: status=3 schedtick=39533 syscalltick=28189 m=-1 runqsize=0 gfreecnt=29 P15: status=3 schedtick=37009 syscalltick=27231 m=-1 runqsize=0 gfreecnt=29 P16: status=3 schedtick=36957 syscalltick=26211 m=44 runqsize=0 gfreecnt=45 P17: status=3 schedtick=35632 syscalltick=26481 m=-1 runqsize=0 gfreecnt=48 P18: status=3 schedtick=36228 syscalltick=24751 m=-1 runqsize=0 gfreecnt=59 P19: status=3 schedtick=32479 syscalltick=22970 m=-1 runqsize=0 gfreecnt=48 P20: status=3 schedtick=34090 syscalltick=25864 m=-1 runqsize=0 gfreecnt=12 P21: status=3 schedtick=35409 syscalltick=24369 m=-1 runqsize=0 gfreecnt=6 P22: status=3 schedtick=42235 syscalltick=28607 m=-1 runqsize=0 gfreecnt=16 P23: status=3 schedtick=29733 syscalltick=19429 m=-1 runqsize=0 gfreecnt=43 P24: status=3 schedtick=34757 syscalltick=22141 m=-1 runqsize=0 gfreecnt=26 P25: status=3 schedtick=31434 syscalltick=21767 m=-1 runqsize=0 gfreecnt=23 P26: status=3 schedtick=31597 syscalltick=21657 m=-1 runqsize=0 gfreecnt=53 P27: status=3 schedtick=32539 syscalltick=22653 m=-1 runqsize=0 gfreecnt=34 P28: status=3 schedtick=29503 syscalltick=20588 m=-1 runqsize=0 gfreecnt=28 P29: status=3 schedtick=30842 syscalltick=20199 m=-1 runqsize=0 gfreecnt=41 P30: status=3 schedtick=29787 syscalltick=19492 m=-1 runqsize=0 gfreecnt=15 P31: status=3 schedtick=28226 syscalltick=19101 m=-1 runqsize=0 gfreecnt=32 P32: status=3 schedtick=34254 syscalltick=24135 m=-1 runqsize=0 gfreecnt=27 P33: status=3 schedtick=34567 syscalltick=22554 m=-1 runqsize=0 gfreecnt=36 P34: status=3 schedtick=30416 syscalltick=19722 m=-1 runqsize=0 gfreecnt=33 P35: status=1 schedtick=19536 syscalltick=13084 m=0 runqsize=0 gfreecnt=50 P36: status=3 schedtick=29336 syscalltick=19015 m=-1 runqsize=0 gfreecnt=28 P37: status=3 schedtick=29794 syscalltick=19499 m=-1 runqsize=0 gfreecnt=40 P38: status=3 schedtick=31748 syscalltick=20359 m=-1 runqsize=0 gfreecnt=35 P39: status=3 schedtick=31851 syscalltick=20260 m=-1 runqsize=0 gfreecnt=46 On Thu, Feb 22, 2018 at 8:47 PM, Michael Andersen <mich...@steelcode.com> wrote: > Hi > > Thanks for your suggestions. This freeze happens to coincide with a larger > number of cgo calls which in turn most likely do blocking read/write from > sockets. How are those treated by the scheduler? > > I am not doing anything FUSE related, nor do I have assembly code. The > only "interesting" stuff is the use of a c library - librados. > > In the goroutine dump from SIGQUIT, some stacks are followed by register > dumps. Are those the only ones currently running? All of those are in > runtime: 16 are in runtime.futex, 2 are in runtime.notetsleepg and one in > runtime.gopark, which would make it seem like it was not an un-preemptable > tight loop, but I am not sure how to parse the SIGQUIT output. > > Thanks > Michael > > >> I don't know what is happening with your program. >> >> This kind of thing can happen if you have a goroutine that is running >> in a tight loop with no function calls or memory allocations. The >> current Go scheduler is non-preemptive, meaning that nothing will stop >> that loop. If that loop occurs while holding a lock, it could block >> the entire rest of the program from running. However, you would see >> this in the stack trace. This problem with the current scheduler is >> https://golang.org/issue/10958. >> >> This kind of thing can happen if you are using an in-process FUSE file >> system implemented by goroutines in your program. The Go runtime >> believes that some system calls, such as pipe or socket, never block. >> If you have somehow set things up so that those system calls enter >> your FUSE file system and depend on some other goroutine running, it >> is possible that that goroutine will never be scheduled. I don't know >> of any bugs like this at present but they have existed in the past. >> Of course if you aren't using a FUSE file system then this is not the >> problem. >> >> This kind of thing can happen if you use assembler code to do a >> blocking operation, or if you use syscall.Rawsyscall to call a system >> call that blocks. That can confuse the scheduler and lead to a >> deadlock. Don't do that. >> >> None of these are likely, but you asked for suggestions, and that's >> what I've come up with. >> >> Ian >> > > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.