[
https://issues.apache.org/jira/browse/TINKERPOP-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephen Mallette reopened TINKERPOP-2769:
-----------------------------------------
> gremlin-server does not reply with a timeout response to all timed out
> requests
> -------------------------------------------------------------------------------
>
> Key: TINKERPOP-2769
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2769
> Project: TinkerPop
> Issue Type: Bug
> Affects Versions: 3.6.0, 3.5.3, 3.6.1, 3.5.4
> Reporter: Roi Martin
> Assignee: Stephen Mallette
> Priority: Major
> Fix For: 3.7.0, 3.6.2, 3.5.5
>
>
> The following reproducer hangs because not all timed out requests are
> reported by gremlin-server.
> *main.go*
> {code:go}
> package main
> import (
> "log"
> "sync"
> "sync/atomic"
> gremlingo "github.com/apache/tinkerpop/gremlin-go/v3/driver"
> )
> const (
> gremlinServerEndpoint = "ws://localhost:8182/gremlin"
> nWorkers = 250
> )
> func main() {
> conn, err := gremlingo.NewDriverRemoteConnection(gremlinServerEndpoint,
> func(settings *gremlingo.DriverRemoteConnectionSettings) {
> settings.LogVerbosity = gremlingo.Off
> })
> if err != nil {
> log.Fatalf("could not create remote connection: %v", err)
> }
> defer conn.Close()
> g := gremlingo.Traversal_().WithRemote(conn)
> var (
> wg sync.WaitGroup
> nPendingRequests int32
> )
> for i := 0; i < nWorkers; i++ {
> wg.Add(1)
> go func(id int) {
> defer wg.Done()
> atomic.AddInt32(&nPendingRequests, 1)
> result, err := g.
> Inject(1).
>
> SideEffect(&gremlingo.Lambda{"Thread.sleep(5_000)", "gremlin-groovy"}).
> Next()
> npr := atomic.AddInt32(&nPendingRequests, -1)
> log.Printf("[%v] result=%v err=%v nPendingRequests=%v",
> id, result, err, npr)
> }(i)
> }
> log.Print("waiting")
> wg.Wait()
> log.Print("done")
> }
> {code}
> Steps to reproduce the issue:
> 1. Run gremlin-server:
> {noformat}
> docker run --rm -ti -p 8182:8182 tinkerpop/gremlin-server:3.5.4-SNAPSHOT 2>&1
> | tee log
> {noformat}
> 2. Execute the reproducer:
> {noformat}
> go run main.go
> {noformat}
> 3. Wait ~30s for the reproducer to hang.
> All these tests has been performed against the default gremlin-server
> configuration. It is important to note that in this machine the gremlin pool
> is set to 8. Higher values may require to adjust the number of workers in the
> reproducer to trigger the issue.
> After following these steps, the output of the reproducer should look like
> this:
> {noformat}
> 2022/06/29 16:08:47 waiting
> 2022/06/29 16:08:52 [22] result=result{object=1 class=int64} err=<nil>
> nPendingRequests=249
> ...
> 2022/06/29 16:09:17 [154] result=result{object=1 class=int64} err=<nil>
> nPendingRequests=208
> 2022/06/29 16:09:17 [243] result=result{object=1 class=int64} err=<nil>
> nPendingRequests=207
> 2022/06/29 16:09:17 [226] result=<nil> err=E0903: there are no results left
> nPendingRequests=206
> 2022/06/29 16:09:17 [164] result=<nil> err=E0903: there are no results left
> nPendingRequests=204
> ...
> 2022/06/29 16:09:17 [131] result=<nil> err=E0903: there are no results left
> nPendingRequests=113
> {noformat}
> On the other side, the gremlin-server logs should look like this:
> {noformat}
> [INFO] ? - Channel started at port 8182.
> [WARN] ? - A timeout occurred during traversal evaluation of
> [RequestMessage{, requestId=9d7ccc61-0e7e-4105-ae35-eb81a3895170,
> op='bytecode', processor='traversal', args={gremlin=[[], [inject(1),
> sideEffect(lambda[Thread.sleep(5_000)])]], aliases={g=g}}}] - consider
> increasing the limit given to evaluationTimeout
> [WARN] ? - A timeout occurred during traversal evaluation of
> [RequestMessage{, requestId=6f2d939a-31c9-48fd-bb54-b2bfa1fbd4de,
> op='bytecode', processor='traversal', args={gremlin=[[], [inject(1),
> sideEffect(lambda[Thread.sleep(5_000)])]], aliases={g=g}}}] - consider
> increasing the limit given to evaluationTimeout
> ...
> {noformat}
> In this specific case,
> - 43 requests finished successfully (250-207 = 43)
> - 94 requests timed out (207-113 = 94)
> - 113 requests never finished (making the reproducer hang)
> If we check the gremlin-server logs, the number of reported timeouts matches
> the output of the reproducer, which makes me think that gremlin-server is the
> culprit.
> {noformat}
> $ grep 'A timeout occurred' log | wc -l
> 94
> {noformat}
> Inspecting the hanging goroutines in the reproducer shows that they are IO
> waiting:
> {noformat}
> (dlv) grs
> ...
> Goroutine 373 - User: /home/n/goroot/src/net/fd_posix.go:55
> net.(*netFD).Read (0x684273) [IO wait]
> ...
> (dlv) gr 373
> Switched from 0 to 373 (thread 709858)
> (dlv) bt
> 0 0x000000000043d212 in runtime.gopark
> at /home/n/goroot/src/runtime/proc.go:362
> 1 0x0000000000435a4a in runtime.netpollblock
> at /home/n/goroot/src/runtime/netpoll.go:522
> 2 0x0000000000465e05 in internal/poll.runtime_pollWait
> at /home/n/goroot/src/runtime/netpoll.go:302
> 3 0x00000000004ff2c8 in internal/poll.(*pollDesc).wait
> at /home/n/goroot/src/internal/poll/fd_poll_runtime.go:83
> 4 0x00000000004ff377 in internal/poll.(*pollDesc).waitRead
> at /home/n/goroot/src/internal/poll/fd_poll_runtime.go:88
> 5 0x0000000000500f8f in internal/poll.(*FD).Read
> at /home/n/goroot/src/internal/poll/fd_unix.go:167
> 6 0x0000000000684273 in net.(*netFD).Read
> at /home/n/goroot/src/net/fd_posix.go:55
> 7 0x000000000069ce49 in net.(*conn).Read
> at /home/n/goroot/src/net/net.go:183
> 8 0x00000000005d9eb4 in bufio.(*Reader).fill
> at /home/n/goroot/src/bufio/bufio.go:106
> 9 0x00000000005da278 in bufio.(*Reader).Peek
> at /home/n/goroot/src/bufio/bufio.go:144
> 10 0x00000000008298a9 in github.com/gorilla/websocket.(*Conn).read
> at /home/n/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:371
> 11 0x000000000082d5fd in github.com/gorilla/websocket.(*Conn).advanceFrame
> at /home/n/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:809
> 12 0x000000000082f1d1 in github.com/gorilla/websocket.(*Conn).NextReader
> at /home/n/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:1009
> 13 0x000000000082fccb in github.com/gorilla/websocket.(*Conn).ReadMessage
> at /home/n/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:1093
> 14 0x00000000008418fb in
> github.com/apache/tinkerpop/gremlin-go/v3/driver.(*gorillaTransporter).Read
> at
> /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/[email protected]/driver/gorillaTransporter.go:116
> 15 0x000000000085a302 in
> github.com/apache/tinkerpop/gremlin-go/v3/driver.(*gremlinServerWSProtocol).readLoop
> at
> /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/[email protected]/driver/protocol.go:57
> 16 0x000000000085c2e7 in
> github.com/apache/tinkerpop/gremlin-go/v3/driver.newGremlinServerWSProtocol.func1
> at
> /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/[email protected]/driver/protocol.go:196
> 17 0x000000000046a841 in runtime.goexit
> at /home/n/goroot/src/runtime/asm_amd64.s:1571
> {noformat}
> The reason why the go program does not finish is that gremlin-server keeps
> responding to pings, therefore gremlin-go's pong handler keeps extending the
> read deadline of the connection.
> */tinkerpop/gremlin-go/driver/gorillaTransporter.go*
> {code:go}
> transporter.connection.SetPongHandler(func(string) error {
> err := transporter.connection.SetReadDeadline(time.Now().Add(2
> * transporter.connSettings.keepAliveInterval))
> if err != nil {
> return err
> }
> return nil
> })
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)