Right, then it looks less likely that we are blocked on a mutex.

Every 2 minutes we spin up a go routine which then in turn spins up a bunch
of go routines to collect data from big query & elastic (data collector
routines).
The 2 minutes go routine then in turn waits on waitgroup for data collector
routines to come back. So its not single go routine from our side at least.
>From backtrace we have from sigabort we see only one such data collector go
routine blocked and other 2 2 mins thread waiting on waitgroup.

I see similar issues on the internet. Kubernetes hit the following issue
which has exactly the same signature as us.

https://github.com/kubernetes/kubernetes/issues/53191

I will follow up with big query guys to see what their behavior is. If one
of the requests gets stuck on big query, what do they do with
other requests?
But in that case in our back trace we should see all other requests waiting
in their code which is not the case.



On Wed, Aug 26, 2020 at 7:44 PM Robert Engels <reng...@ix.netcom.com> wrote:

> If you look at the stack trace the futex is because it is trying to
> shutdown the entire process - killing all of the M’s - probably because you
> sent the sigabort
>
> On Aug 26, 2020, at 9:41 PM, Robert Engels <reng...@ix.netcom.com> wrote:
>
> 
> The big query client may be “single threaded” meaning you may need a pool
> of connections. Not sure - haven’t used the BQ library from Go.
>
> If that’s the case a single slow query will block all of your connections
> (if they are all accessing BQ). Plus there are BQ rate limits - when you
> hit those the connector may block - again blocking all clients.
>
> On Aug 26, 2020, at 8:27 PM, Siddhesh Divekar <siddhesh.dive...@gmail.com>
> wrote:
>
> 
> Robert,
>
> That's where the confusion is.
> From the backtrace we see that two go routines are blocked on the
> waitgroup.
> The go routine on which they are blocked is waiting on big query to return.
>
> On every user request we create a new go routine so they shouldn't get
> blocked because of these two.
>
> However, we do see one go routine blocked on futex. Is there a way to know
> from where it originated from ?
> When does golang use futex under the hold so that we can investigate from
> that angle.
>
> goroutine 0 [idle]:
> runtime.futex(0x5b607a8, 0x80, 0x0, 0x0, 0x0, 0x7ffc5dd1bd20, 0x437640,
> 0xc003638038, 0x7ffc5dd1bd48, 0x40c78f, ...)
>         /usr/local/go/src/runtime/sys_linux_amd64.s:535 +0x21
> runtime.futexsleep(0x5b607a8, 0x0, 0xffffffffffffffff)
>         /usr/local/go/src/runtime/os_linux.go:44 +0x46
> runtime.notesleep(0x5b607a8)
>         /usr/local/go/src/runtime/lock_futex.go:151 +0x9f
> runtime.stopm()
>         /usr/local/go/src/runtime/proc.go:1928 +0xc0
> runtime.exitsyscall0(0xc003638000)
>         /usr/local/go/src/runtime/proc.go:3140 +0x111
> runtime.mcall(0x0)
>         /usr/local/go/src/runtime/asm_amd64.s:318 +0x5b
>
> On Wed, Aug 26, 2020 at 5:02 PM robert engels <reng...@ix.netcom.com>
> wrote:
>
>> That should allow your server to clean up “dead” clients. Typically you
>> use this in conjunction with a ‘keep alive’ in the protocol.
>>
>> I am doubtful that a bunch of dead clients hanging around would cause a
>> CPU spike. You really don’t have too many Go routines/connections involved
>> (I’ve worked with 1000’s of live connections).
>>
>> I would look at something else… I am guessing your synchronization is
>> incorrect and your threads are blocking, and you have a few that are
>> spinning expecting something to happen that never will (because the source
>> of the event is blocked on the mutex/lock).
>>
>>
>>
>> On Aug 26, 2020, at 10:09 AM, Siddhesh Divekar <
>> siddhesh.dive...@gmail.com> wrote:
>>
>> Robert,
>>
>> I assume we can safely add these timeouts based on what we expect
>> should be a reasonable timeout for our clients ?
>>
>> s.ReadTimeout = expTimeOut * time.Second
>> s.WriteTimeout = expTimeOut * time.Second
>>
>> On Tue, Aug 25, 2020 at 1:14 PM Siddhesh Divekar <
>> siddhesh.dive...@gmail.com> wrote:
>>
>>> Both servers and data sources are in the cloud.
>>> I would not say a lot of data, it's precomputed data which shouldn't
>>> take that long.
>>>
>>>
>>> On Tue, Aug 25, 2020 at 11:25 AM Robert Engels <reng...@ix.netcom.com>
>>> wrote:
>>>
>>>> Are you transferring a lot of data? Are the servers non-cloud hosted?
>>>>
>>>> You could be encountering “tcp stalls”.
>>>>
>>>> On Aug 25, 2020, at 9:24 AM, Siddhesh Divekar <
>>>> siddhesh.dive...@gmail.com> wrote:
>>>>
>>>> 
>>>> Clients are over the internet.
>>>>
>>>> On Tue, Aug 25, 2020 at 3:25 AM Robert Engels <reng...@ix.netcom.com>
>>>> wrote:
>>>>
>>>>> The tcp protocol allows the connection to wait for hours. Go routines
>>>>> stuck in wait do not burn CPU. Are the clients local or remote (over
>>>>> internet)?
>>>>>
>>>>> On Aug 24, 2020, at 10:29 PM, Siddhesh Divekar <
>>>>> siddhesh.dive...@gmail.com> wrote:
>>>>>
>>>>> 
>>>>> Robert,
>>>>>
>>>>> We will do the profiling next time we hit the issue again & see what
>>>>> is happening.
>>>>> This was the first time we saw the issue & don't want to get rid of
>>>>> http2 advantages without making sure it's the actual culprit.
>>>>>
>>>>> Do you think in the meanwhile we should do what the discussion below
>>>>> suggests anyways ?
>>>>>
>>>>> https://stackoverflow.com/questions/42238695/goroutine-in-io-wait-state-for-long-time
>>>>>
>>>>>
>>>>> On Mon, Aug 24, 2020 at 5:37 PM Robert Engels <reng...@ix.netcom.com>
>>>>> wrote:
>>>>>
>>>>>> I think it is too hard to tell with the limited information. It could
>>>>>> be exhausted connections or it could be thrashing (the claim of high cpu)
>>>>>>
>>>>>> I think you want to run profiling capture prior to hitting the stick
>>>>>> state - you should be able to detect what is happening.
>>>>>>
>>>>>> If the referenced issue is related I would assume you should be able
>>>>>> to connect by forcing http/1.
>>>>>>
>>>>>> You can also try disabling http/2 and see if your issue goes away.
>>>>>>
>>>>>> On Aug 24, 2020, at 6:15 PM, Siddhesh Divekar <
>>>>>> siddhesh.dive...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Robert,
>>>>>>
>>>>>> Sorry I missed your earlier response.
>>>>>>
>>>>>> From what we saw our UI was blocked and since everything was
>>>>>> unresponsive
>>>>>> we had to recover the system by sending sig abort.
>>>>>>
>>>>>> On Mon, Aug 24, 2020 at 4:11 PM Siddhesh Divekar <
>>>>>> siddhesh.dive...@gmail.com> wrote:
>>>>>>
>>>>>>> Looking at the no. of go routines we have does this apply to this
>>>>>>> issue ?
>>>>>>> https://github.com/golang/go/issues/27044
>>>>>>>
>>>>>>> On Mon, Aug 24, 2020 at 12:54 PM Robert Engels <
>>>>>>> reng...@ix.netcom.com> wrote:
>>>>>>>
>>>>>>>> Go routines in a waiting state should not be consuming CPU. Are you
>>>>>>>> certain they are not in constant transition from waiting to processing 
>>>>>>>> -
>>>>>>>> this could show up as high CPU usage while everything looks blocks.
>>>>>>>>
>>>>>>>> I would use pprof - github.com/robaho/goanalyzer might be of
>>>>>>>> assistance here to see the actual work being done.
>>>>>>>>
>>>>>>>> On Aug 24, 2020, at 9:10 AM, Siddhesh Divekar <
>>>>>>>> siddhesh.dive...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> 
>>>>>>>> Hi Ian,
>>>>>>>>
>>>>>>>> Thanks for replying.
>>>>>>>>
>>>>>>>> We have a go server running which handles user requests & collects
>>>>>>>> data from various sources like gcp's cloud sql and big query.
>>>>>>>> We are using shopify's sarama library to do kafka operations.
>>>>>>>>
>>>>>>>> There are seeing lots of go routines in waiting state for several
>>>>>>>> minutes.
>>>>>>>> Over the period of time around 587 goroutines have been spun up.
>>>>>>>>
>>>>>>>> We see that two go routines are stuck on gcp big query and we are
>>>>>>>> using wait groups there.
>>>>>>>> However, it's not clear why that would cause all other go routines
>>>>>>>> to get hung & make cpu go high.
>>>>>>>>
>>>>>>>> goroutine 3332131 [semacquire, 79 minutes]:
>>>>>>>> sync.runtime_Semacquire(0xc001c4fcf8)
>>>>>>>>         /usr/local/go/src/runtime/sema.go:56 +0x42
>>>>>>>> sync.(*WaitGroup).Wait(0xc001c4fcf0)
>>>>>>>>         /usr/local/go/src/sync/waitgroup.go:130 +0x64
>>>>>>>>
>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runParallelQuery(0xc001b54d40,
>>>>>>>> 0xc002912c00, 0x330e1b0, 0xf, 0xc002912cf0, 0x3)
>>>>>>>>         /builds/fusionio/fusion/controller/stats/ins_mgr.go:488
>>>>>>>> +0x1d7
>>>>>>>>
>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).GetMainUi(0xc001b54d40,
>>>>>>>> 0xc002912db8, 0xc001870e68, 0x746121, 0xc0010fcaf8, 0x17)
>>>>>>>>         /builds/fusionio/fusion/controller/stats/ins_mgr.go:567
>>>>>>>> +0xa0d
>>>>>>>>
>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).fetchMainUiTeamInterval(0xc001b56780,
>>>>>>>> 0xc002356810, 0x24, 0x32f7b78, 0x5)
>>>>>>>>         /builds/fusionio/fusion/controller/stats/prefetcher.go:77
>>>>>>>> +0xf2
>>>>>>>> created by
>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).prefetchStats
>>>>>>>>         /builds/fusionio/fusion/controller/stats/prefetcher.go:100
>>>>>>>> +0xd8
>>>>>>>>
>>>>>>>>
>>>>>>>> goroutine 3332149 [semacquire, 79 minutes]:
>>>>>>>> sync.runtime_Semacquire(0xc0015ede48)
>>>>>>>>         /usr/local/go/src/runtime/sema.go:56 +0x42
>>>>>>>> sync.(*WaitGroup).Wait(0xc0015ede40)
>>>>>>>>         /usr/local/go/src/sync/waitgroup.go:130 +0x64
>>>>>>>>
>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runParallelQuery(0xc001b54d40,
>>>>>>>> 0xc00249dc00, 0x330e1b0, 0xf, 0xc00249dcf0, 0x3)
>>>>>>>>         /builds/fusionio/fusion/controller/stats/ins_mgr.go:488
>>>>>>>> +0x1d7
>>>>>>>>
>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).GetMainUi(0xc001b54d40,
>>>>>>>> 0xc00249ddb8, 0xc003200668, 0xc00407a520, 0xc003200590, 0x46ee97)
>>>>>>>>         /builds/fusionio/fusion/controller/stats/ins_mgr.go:567
>>>>>>>> +0xa0d
>>>>>>>>
>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).fetchMainUiTeamInterval(0xc001b56780,
>>>>>>>> 0xc002356ba0, 0x24, 0x32f7b78, 0x5)
>>>>>>>>         /builds/fusionio/fusion/controller/stats/prefetcher.go:77
>>>>>>>> +0xf2
>>>>>>>> created by
>>>>>>>> git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).prefetchStats
>>>>>>>>         /builds/fusionio/fusion/controller/stats/prefetcher.go:100
>>>>>>>> +0xd8
>>>>>>>>
>>>>>>>> I found the link below which kind of co-relates to our scenario.
>>>>>>>>
>>>>>>>> https://stackoverflow.com/questions/42238695/goroutine-in-io-wait-state-for-long-time
>>>>>>>>
>>>>>>>> Most of the go routines in the backtrace are in a net/http package
>>>>>>>> so our suspicion is that above bug in our code might be causing that.
>>>>>>>> Even the bigquery is getting hung in net/http.
>>>>>>>>
>>>>>>>> We are using go version 1.13.8 & are running on gcp kubernetes
>>>>>>>> cluster on ubuntu 18.04 docker.
>>>>>>>>
>>>>>>>> go env
>>>>>>>> GO111MODULE=""
>>>>>>>> GOARCH="amd64"
>>>>>>>> GOBIN=""
>>>>>>>> GOCACHE="/root/.cache/go-build"
>>>>>>>> GOENV="/root/.config/go/env"
>>>>>>>> GOEXE=""
>>>>>>>> GOFLAGS=""
>>>>>>>> GOHOSTARCH="amd64"
>>>>>>>> GOHOSTOS="linux"
>>>>>>>> GONOPROXY=""
>>>>>>>> GONOSUMDB=""
>>>>>>>> GOOS="linux"
>>>>>>>> GOPATH="/root/go"
>>>>>>>> GOPRIVATE=""
>>>>>>>> GOPROXY="https://proxy.golang.org,direct";
>>>>>>>> GOROOT="/usr/local/go"
>>>>>>>> GOSUMDB="sum.golang.org"
>>>>>>>> GOTMPDIR=""
>>>>>>>> GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
>>>>>>>> GCCGO="gccgo"
>>>>>>>> AR="ar"
>>>>>>>> CC="gcc"
>>>>>>>> CXX="g++"
>>>>>>>> CGO_ENABLED="1"
>>>>>>>> GOMOD="/builds/prosimoio/prosimo/pdash/go.mod"
>>>>>>>> CGO_CFLAGS="-g -O2"
>>>>>>>> CGO_CPPFLAGS=""
>>>>>>>> CGO_CXXFLAGS="-g -O2"
>>>>>>>> CGO_FFLAGS="-g -O2"
>>>>>>>> CGO_LDFLAGS="-g -O2"
>>>>>>>> PKG_CONFIG="pkg-config"
>>>>>>>> GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0
>>>>>>>> -fdebug-prefix-map=/tmp/go-build048009048=/tmp/go-build
>>>>>>>> -gno-record-gcc-switches"
>>>>>>>>
>>>>>>>> Let me know if any other information is needed.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Aug 22, 2020 at 12:30 PM Ian Lance Taylor <i...@golang.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Sat, Aug 22, 2020 at 12:06 PM Siddhesh Divekar
>>>>>>>>> <siddhesh.dive...@gmail.com> wrote:
>>>>>>>>> >
>>>>>>>>> > We saw an issue with our process running in k8s on ubuntu 18.04.
>>>>>>>>> > Most of the go routines are stuck for several minutes in
>>>>>>>>> http/http2 net code.
>>>>>>>>> >
>>>>>>>>> > Have you seen similar issues ?
>>>>>>>>> >
>>>>>>>>> > goroutine 2800143 [select, 324 minutes]:
>>>>>>>>> net/http.(*persistConn).readLoop(0xc00187d440)
>>>>>>>>> /usr/local/go/src/net/http/transport.go:2032 +0x999 created by
>>>>>>>>> net/http.(*Transport).dialConn 
>>>>>>>>> /usr/local/go/src/net/http/transport.go:1580
>>>>>>>>> +0xb0d goroutine 2738894 [IO wait, 352 minutes]:
>>>>>>>>> internal/poll.runtime_pollWait(0x7f5b61b280c0, 0x72, 
>>>>>>>>> 0xffffffffffffffff)
>>>>>>>>> /usr/local/go/src/runtime/netpoll.go:184 +0x55
>>>>>>>>> internal/poll.(*pollDesc).wait(0xc0017e7e18, 0x72, 0x1000, 0x1000,
>>>>>>>>> 0xffffffffffffffff) 
>>>>>>>>> /usr/local/go/src/internal/poll/fd_poll_runtime.go:87
>>>>>>>>> +0x45 internal/poll.(*pollDesc).waitRead(...)
>>>>>>>>> /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
>>>>>>>>> internal/poll.(*FD).Read(0xc0017e7e00, 0xc0044a9000, 0x1000, 0x1000, 
>>>>>>>>> 0x0,
>>>>>>>>> 0x0, 0x0) /usr/local/go/src/internal/poll/fd_unix.go:169 +0x1cf
>>>>>>>>> net.(*netFD).Read(0xc0017e7e00, 0xc0044a9000, 0x1000, 0x1000, 
>>>>>>>>> 0xc0026359e8,
>>>>>>>>> 0x49d7fd, 0xc0017e7e00) /usr/local/go/src/net/fd_unix.go:202 +0x4f
>>>>>>>>> net.(*conn).Read(0xc0000db8b8, 0xc0044a9000, 0x1000, 0x1000, 0x0, 
>>>>>>>>> 0x0, 0x0)
>>>>>>>>> /usr/local/go/src/net/net.go:184 +0x68
>>>>>>>>> net/http.(*connReader).Read(0xc004a4fec0, 0xc0044a9000, 0x1000, 
>>>>>>>>> 0x1000,
>>>>>>>>> 0x0, 0x0, 0x0) /usr/local/go/src/net/http/server.go:785 +0xf4
>>>>>>>>> bufio.(*Reader).fill(0xc003f1a360) 
>>>>>>>>> /usr/local/go/src/bufio/bufio.go:100
>>>>>>>>> +0x103 bufio.(*Reader).Peek(0xc003f1a360, 0x4, 0x0, 0x0, 0x0, 0x0,
>>>>>>>>> 0xc002635ad0) /usr/local/go/src/bufio/bufio.go:138 +0x4f
>>>>>>>>> net/http.(*conn).readRequest(0xc0028e1d60, 0x393ed20, 0xc0024e9780, 
>>>>>>>>> 0x0,
>>>>>>>>> 0x0, 0x0) /usr/local/go/src/net/http/server.go:962 +0xb3b
>>>>>>>>> net/http.(*conn).serve(0xc0028e1d60, 0x393ed20, 0xc0024e9780)
>>>>>>>>> /usr/local/go/src/net/http/server.go:1817 +0x6d4 created by
>>>>>>>>> net/http.(*Server).Serve /usr/local/go/src/net/http/server.go:2928 
>>>>>>>>> +0x384
>>>>>>>>> >
>>>>>>>>> > Is there a know issue or something obvious from the backtrace
>>>>>>>>> here.
>>>>>>>>>
>>>>>>>>> It's entirely normal for goroutines to sit in pollWait if they are
>>>>>>>>> waiting for network I/O.  There may be reasons why this is
>>>>>>>>> incorrect
>>>>>>>>> for your program, but you'll have to tell us those reasons.
>>>>>>>>>
>>>>>>>>> Also, along with those reasons, please tell us the version of Go
>>>>>>>>> and
>>>>>>>>> the exact environment that you are running.  Thanks.
>>>>>>>>>
>>>>>>>>> Ian
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -Siddhesh.
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "golang-nuts" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to golang-nuts+unsubscr...@googlegroups.com.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BggC%2BwgwM_%2BM5ML0SKD3qJphCrif%3D4c2AqB9v6n%2Btw5Jw%40mail.gmail.com
>>>>>>>> <https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BggC%2BwgwM_%2BM5ML0SKD3qJphCrif%3D4c2AqB9v6n%2Btw5Jw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> <backtrace.txt>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> -Siddhesh.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -Siddhesh.
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> -Siddhesh.
>>>>>
>>>>>
>>>>
>>>> --
>>>> -Siddhesh.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "golang-nuts" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to golang-nuts+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BgQDHshxmGFiUFh6Ket9fp%2BwMuhgH6-FksyzMp9wz%2BXug%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BgQDHshxmGFiUFh6Ket9fp%2BwMuhgH6-FksyzMp9wz%2BXug%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>>
>>>
>>> --
>>> -Siddhesh.
>>>
>>
>>
>> --
>> -Siddhesh.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to golang-nuts+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BgHmUTo-f4u2NOtvzuZ-VH7708%3DXBcM02c76KDCCPiY%3Dw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BgHmUTo-f4u2NOtvzuZ-VH7708%3DXBcM02c76KDCCPiY%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>>
>>
>
> --
> -Siddhesh.
>
>

-- 
-Siddhesh.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2Bi8mu4s2F7J0x6f4wKhqhxxk2qRwK56GnWHysxwgpAy0g%40mail.gmail.com.

Reply via email to