Ok I found the problem. A tight loop was indeed preventing the garbage collector from proceeding, and then the GC was stopping everything else. The loop contained atomic.LoadUint64 and atomic.CompareAndSwapUint64, but neither of those seem to qualify as preemption points. Fixing that loop fixes everything.
Thanks to Ian for pointing me on the right path and thanks to Florin Pățan on gophers.slack.com for pointing me at Delve which eventually let me find the problematic code. I know this is not at all the job of the race detector, but given that compiling with -race is the first thing we do when a program experiences "deadlock" type behavior, and I have seen several people being thrown off by the way go behaves when there are tight loops, can we not detect that the GC has stalled for way too long and print something out? Like "Goroutine #545 is preventing GC" or something? Just a thought. Thanks Michael On Thu, Feb 22, 2018 at 11:09 PM, Michael Andersen <mich...@steelcode.com> wrote: > I have found that GODEBUG=schedtrace still prints when the program is > frozen but I cannot find much documentation on deciphering its output. The > full output is here: https://gist.github.com/immesys/ > 7f213c63c54ce60ba72d1cbc9ffcc4cb but here are the first few lines. It > looks like this is garbage collector related? > > SCHED 111364ms: gomaxprocs=40 idleprocs=0 threads=78 spinningthreads=0 > idlethreads=65 runqueue=31 gcwaiting=1 nmidlelocked=1 stopwait=1 > sysmonwait=0 > P0: status=3 schedtick=36699 syscalltick=29158 m=-1 runqsize=0 gfreecnt=45 > P1: status=3 schedtick=37370 syscalltick=30563 m=-1 runqsize=0 gfreecnt=31 > P2: status=3 schedtick=36650 syscalltick=29478 m=-1 runqsize=0 gfreecnt=17 > P3: status=3 schedtick=42395 syscalltick=34028 m=-1 runqsize=0 gfreecnt=38 > P4: status=3 schedtick=40531 syscalltick=33546 m=-1 runqsize=0 gfreecnt=27 > P5: status=3 schedtick=41369 syscalltick=31884 m=-1 runqsize=0 gfreecnt=19 > P6: status=3 schedtick=39302 syscalltick=32162 m=-1 runqsize=0 gfreecnt=39 > P7: status=3 schedtick=39739 syscalltick=32624 m=-1 runqsize=0 gfreecnt=34 > P8: status=3 schedtick=32585 syscalltick=24358 m=-1 runqsize=0 gfreecnt=45 > P9: status=3 schedtick=40376 syscalltick=30399 m=-1 runqsize=0 gfreecnt=27 > P10: status=3 schedtick=31018 syscalltick=22164 m=-1 runqsize=0 gfreecnt=40 > P11: status=3 schedtick=32043 syscalltick=24479 m=-1 runqsize=0 gfreecnt=53 > P12: status=3 schedtick=36459 syscalltick=25222 m=-1 runqsize=0 gfreecnt=21 > P13: status=3 schedtick=38445 syscalltick=28317 m=-1 runqsize=0 gfreecnt=38 > P14: status=3 schedtick=39533 syscalltick=28189 m=-1 runqsize=0 gfreecnt=29 > P15: status=3 schedtick=37009 syscalltick=27231 m=-1 runqsize=0 gfreecnt=29 > P16: status=3 schedtick=36957 syscalltick=26211 m=44 runqsize=0 gfreecnt=45 > P17: status=3 schedtick=35632 syscalltick=26481 m=-1 runqsize=0 gfreecnt=48 > P18: status=3 schedtick=36228 syscalltick=24751 m=-1 runqsize=0 gfreecnt=59 > P19: status=3 schedtick=32479 syscalltick=22970 m=-1 runqsize=0 gfreecnt=48 > P20: status=3 schedtick=34090 syscalltick=25864 m=-1 runqsize=0 gfreecnt=12 > P21: status=3 schedtick=35409 syscalltick=24369 m=-1 runqsize=0 gfreecnt=6 > P22: status=3 schedtick=42235 syscalltick=28607 m=-1 runqsize=0 gfreecnt=16 > P23: status=3 schedtick=29733 syscalltick=19429 m=-1 runqsize=0 gfreecnt=43 > P24: status=3 schedtick=34757 syscalltick=22141 m=-1 runqsize=0 gfreecnt=26 > P25: status=3 schedtick=31434 syscalltick=21767 m=-1 runqsize=0 gfreecnt=23 > P26: status=3 schedtick=31597 syscalltick=21657 m=-1 runqsize=0 gfreecnt=53 > P27: status=3 schedtick=32539 syscalltick=22653 m=-1 runqsize=0 gfreecnt=34 > P28: status=3 schedtick=29503 syscalltick=20588 m=-1 runqsize=0 gfreecnt=28 > P29: status=3 schedtick=30842 syscalltick=20199 m=-1 runqsize=0 gfreecnt=41 > P30: status=3 schedtick=29787 syscalltick=19492 m=-1 runqsize=0 gfreecnt=15 > P31: status=3 schedtick=28226 syscalltick=19101 m=-1 runqsize=0 gfreecnt=32 > P32: status=3 schedtick=34254 syscalltick=24135 m=-1 runqsize=0 gfreecnt=27 > P33: status=3 schedtick=34567 syscalltick=22554 m=-1 runqsize=0 gfreecnt=36 > P34: status=3 schedtick=30416 syscalltick=19722 m=-1 runqsize=0 gfreecnt=33 > P35: status=1 schedtick=19536 syscalltick=13084 m=0 runqsize=0 gfreecnt=50 > P36: status=3 schedtick=29336 syscalltick=19015 m=-1 runqsize=0 gfreecnt=28 > P37: status=3 schedtick=29794 syscalltick=19499 m=-1 runqsize=0 gfreecnt=40 > P38: status=3 schedtick=31748 syscalltick=20359 m=-1 runqsize=0 gfreecnt=35 > P39: status=3 schedtick=31851 syscalltick=20260 m=-1 runqsize=0 gfreecnt=46 > > > > On Thu, Feb 22, 2018 at 8:47 PM, Michael Andersen <mich...@steelcode.com> > wrote: > >> Hi >> >> Thanks for your suggestions. This freeze happens to coincide with a >> larger number of cgo calls which in turn most likely do blocking read/write >> from sockets. How are those treated by the scheduler? >> >> I am not doing anything FUSE related, nor do I have assembly code. The >> only "interesting" stuff is the use of a c library - librados. >> >> In the goroutine dump from SIGQUIT, some stacks are followed by register >> dumps. Are those the only ones currently running? All of those are in >> runtime: 16 are in runtime.futex, 2 are in runtime.notetsleepg and one in >> runtime.gopark, which would make it seem like it was not an un-preemptable >> tight loop, but I am not sure how to parse the SIGQUIT output. >> >> Thanks >> Michael >> >> >>> I don't know what is happening with your program. >>> >>> This kind of thing can happen if you have a goroutine that is running >>> in a tight loop with no function calls or memory allocations. The >>> current Go scheduler is non-preemptive, meaning that nothing will stop >>> that loop. If that loop occurs while holding a lock, it could block >>> the entire rest of the program from running. However, you would see >>> this in the stack trace. This problem with the current scheduler is >>> https://golang.org/issue/10958. >>> >>> This kind of thing can happen if you are using an in-process FUSE file >>> system implemented by goroutines in your program. The Go runtime >>> believes that some system calls, such as pipe or socket, never block. >>> If you have somehow set things up so that those system calls enter >>> your FUSE file system and depend on some other goroutine running, it >>> is possible that that goroutine will never be scheduled. I don't know >>> of any bugs like this at present but they have existed in the past. >>> Of course if you aren't using a FUSE file system then this is not the >>> problem. >>> >>> This kind of thing can happen if you use assembler code to do a >>> blocking operation, or if you use syscall.Rawsyscall to call a system >>> call that blocks. That can confuse the scheduler and lead to a >>> deadlock. Don't do that. >>> >>> None of these are likely, but you asked for suggestions, and that's >>> what I've come up with. >>> >>> Ian >>> >> >> > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.