Ok, so I have more information, and it's not what I would expect.

I added scheddetail=1,schedtrace=2000 so that I had a list of which M's and
G's were on the P's during the 5 seconds that scheduling stalled. I added a
sentinel goroutine that sleeps 1 second in a loop and panics if the sleep
takes more than 4 seconds. As soon as the sentinel wakes up after the
stall, it dumps all the stacks for all the goroutines under  the assumption
that at least some of them might be in the same place as they were during
the stall. Thus, I know for sure which goroutines caused the problem and
maybe which piece of the code they were in.

During the stall, all of the P's had the same M's and the same G's, so
nothing was changing except the idlethreads and runqueues (as I would
expect because my sentinel goroutine didn't get run). Here is a compilation
of the information captured during the stall.

SCHED 3425490ms: gomaxprocs=8 idleprocs=0 threads=52 spinningthreads=0
idlethreads=35 runqueue=8 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0
 < same M's on the same P's with the same curg's as below >
SCHED 3427497ms: gomaxprocs=8 idleprocs=0 threads=52 spinningthreads=0
idlethreads=36 runqueue=9 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0
  P0: status=1 schedtick=22978494 syscalltick=6811602 m=44 runqsize=0
gfreecnt=2
  M44: p=0 curg=12934173 mallocing=0 throwing=0 preemptoff= locks=0 dying=0
spinning=false blocked=false lockedg=-1
  G12934173: status=2(chan receive) m=44 lockedm=-1

  P1: status=1 schedtick=22848756 syscalltick=6851102 m=34 runqsize=0
gfreecnt=32
  M34: p=1 curg=12934084 mallocing=0 throwing=0 preemptoff= locks=0 dying=0
spinning=false blocked=false lockedg=-1
  G12934084: status=2(chan receive) m=34 lockedm=-1

  P2: status=1 schedtick=22743131 syscalltick=6730185 m=17 runqsize=0
gfreecnt=37
  M17: p=2 curg=12934032 mallocing=0 throwing=0 preemptoff= locks=0 dying=0
spinning=false blocked=false lockedg=-1
  G12934032: status=2(chan receive) m=17 lockedm=-1

  P3: status=1 schedtick=22674653 syscalltick=6516472 m=11 runqsize=0
gfreecnt=20
  M11: p=3 curg=12934147 mallocing=0 throwing=0 preemptoff= locks=0 dying=0
spinning=false blocked=false lockedg=-1
  G12934147: status=2(chan receive) m=11 lockedm=-1

  P4: status=1 schedtick=22693327 syscalltick=6322345 m=33 runqsize=0
gfreecnt=12
  M33: p=4 curg=12934059 mallocing=0 throwing=0 preemptoff= locks=0 dying=0
spinning=false blocked=false lockedg=-1
  G12934059: status=2(chan receive) m=33 lockedm=-1

  P5: status=1 schedtick=22370152 syscalltick=6219903 m=0 runqsize=0
gfreecnt=39
  M0: p=5 curg=12934131 mallocing=0 throwing=0 preemptoff= locks=0 dying=0
spinning=false blocked=false lockedg=-1
  G12934131: status=2(chan receive) m=0 lockedm=-1

  P6: status=1 schedtick=21910259 syscalltick=6171748 m=51 runqsize=0
gfreecnt=39
  M51: p=6 curg=12934126 mallocing=0 throwing=0 preemptoff= locks=0 dying=0
spinning=false blocked=false lockedg=-1
  G12934126: status=2(chan receive) m=51 lockedm=-1

  P7: status=1 schedtick=22019793 syscalltick=6091894 m=32 runqsize=1
gfreecnt=30
  M32: p=7 curg=12934191 mallocing=0 throwing=0 preemptoff= locks=0 dying=0
spinning=false blocked=false lockedg=-1
  G12934191: status=2(chan receive) m=32 lockedm=-1

After the sentinel goroutine was scheduled (~1 second after this schedtrace
output) I captured the stacks and crossreferenced the G's. All of them are
in time.Now()

  goroutine 12934173 [runnable]:
  time.Now(0x7f453a441dc0, 0xc0169fe000, 0x5a0)
  /usr/local/go/src/time/time.go:1087 +0xb2
  -- line in request handler in my app--

  goroutine 12934084 [runnable]:
  time.Now(0x7f4614a65400, 0xc0155a2000, 0x1680)
  /usr/local/go/src/time/time.go:1087 +0xb2
  -- line in request handler in my app--

  goroutine 12934032 [runnable]:
  time.Now(0x7f457d74ae20, 0xc023179400, 0x3c0)
  /usr/local/go/src/time/time.go:1087 +0xb2
  -- line in request handler in my app--

  goroutine 12934147 [runnable]:
  time.Now(0x7f461fcca4c0, 0xc00cf38400, 0x3c0)
  /usr/local/go/src/time/time.go:1087 +0xb2
  -- line in request handler in my app--

  goroutine 12934059 [runnable]:
  time.Now(0x7f45ff9dcd00, 0xc0245b1000, 0xf00)
  /usr/local/go/src/time/time.go:1087 +0xb2
  -- line in request handler in my app--

  goroutine 12934131 [runnable]:
  time.Now(0x7f4513453e80, 0xc02a96c000, 0x5a0)
  /usr/local/go/src/time/time.go:1087 +0xb2
  -- line in request handler in my app--

  goroutine 12934126 [runnable]:
  time.Now(0x7f457fd46e60, 0xc016889200, 0x5a0)
  /usr/local/go/src/time/time.go:1087 +0xb2
  -- line in request handler in my app--

  goroutine 12934191 [runnable]:
  time.Now(0x7f45088d6b80, 0xc01770ec00, 0x5a0)
  /usr/local/go/src/time/time.go:1087 +0xb2
  -- line in request handler in my app--

I went through the path that a goroutine takes to get to that time.Now()
line and it flows from GRPC through to some simple sanitization. There are
function calls sprinkled everywhere and no loops that would be
unpreemtible, so I don't think they all got stuck in some tight loop before
the time.Now and then somehow all proceeded a few lines so that they ALL
were on time.Now when I dumped the stacks. I think they were stuck in
time.Now() during the scheduling stall.

This is running in kubernetes on AWS, although I have seen it on a bare
metal kubernetes cluster too.

Any ideas where to progress from here?



On Fri, Aug 23, 2019 at 3:32 PM Ian Lance Taylor <i...@golang.org> wrote:

> On Fri, Aug 23, 2019 at 2:30 PM Michael Andersen <mich...@steelcode.com>
> wrote:
> >
> > Are you suggesting that there might be enough unpreemtable goroutines to
> fill all the P's? I do have several cgo goroutines sitting in syscalls, but
> my understanding was that the scheduler would "preempt" that by moving the
> M off the P and creating a new M.
>
> That is how it is supposed to work, yes.  That process can take up to
> 10ms, but that doesn't explain what you are seeing.
>
> Ian
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAPLpPrsqcBd7A6dPeMO0B19hhnWbanCauEwaYihFuB%3D7c7d0QA%40mail.gmail.com.

Reply via email to