fishy commented on pull request #2295:
URL: https://github.com/apache/thrift/pull/2295#issuecomment-746667382


   @dcelasun so if you look at the data race output (note that for stdlib the 
line numbers are for [go1.10.8](https://github.com/golang/go/tree/go1.10.8)):
   
   ```
   ==================
   WARNING: DATA RACE
   Read at 0x00c42030e1d8 by goroutine 14:
     bytes.(*Buffer).Read()
         /usr/local/go/src/bytes/buffer.go:70 +0x80
     io/ioutil.(*nopCloser).Read()
         <autogenerated>:1 +0x7d
     net/http.transferBodyReader.Read()
         /usr/local/go/src/net/http/transfer.go:60 +0x75
     io.(*LimitedReader).Read()
         /usr/local/go/src/io/io.go:446 +0xca
     bufio.(*Writer).ReadFrom()
         /usr/local/go/src/bufio/bufio.go:703 +0x13b
     io.copyBuffer()
         /usr/local/go/src/io/io.go:386 +0x417
     io.Copy()
         /usr/local/go/src/io/io.go:362 +0x74
     net/http.(*transferWriter).WriteBody()
         /usr/local/go/src/net/http/transfer.go:337 +0x98a
     net/http.(*Request).write()
         /usr/local/go/src/net/http/request.go:622 +0x797
     net/http.(*persistConn).writeLoop()
         /usr/local/go/src/net/http/transport.go:1825 +0x2e0
   
   Previous write at 0x00c42030e1d8 by goroutine 31:
     _/thrift/src/lib/go/thrift.(*THttpClient).Flush.func1()
         /usr/local/go/src/bytes/buffer.go:100 +0xaf
     _/thrift/src/lib/go/thrift.(*THttpClient).Flush()
         /thrift/src/lib/go/thrift/http_client.go:218 +0x4cd
     _/thrift/src/lib/go/thrift.TestHTTPClientFlushesRequestBufferOnErrors()
         /thrift/src/lib/go/thrift/http_client_test.go:130 +0x303
     testing.tRunner()
         /usr/local/go/src/testing/testing.go:777 +0x16d
   
   Goroutine 14 (running) created at:
     net/http.(*Transport).dialConn()
         /usr/local/go/src/net/http/transport.go:1238 +0xb4a
     net/http.(*Transport).getConn.func4()
         /usr/local/go/src/net/http/transport.go:957 +0x9f
   
   Goroutine 31 (running) created at:
     testing.(*T).Run()
         /usr/local/go/src/testing/testing.go:824 +0x564
     testing.runTests.func1()
         /usr/local/go/src/testing/testing.go:1063 +0xa4
     testing.tRunner()
         /usr/local/go/src/testing/testing.go:777 +0x16d
     testing.runTests()
         /usr/local/go/src/testing/testing.go:1061 +0x4e1
     testing.(*M).Run()
         /usr/local/go/src/testing/testing.go:978 +0x2cd
     main.main()
         _testmain.go:332 +0x22a
   ==================
   ```
   
   It's that goroutine 31 (the one running the test) called the deferred 
`Reset` on the buffer inside `Flush` function, while goroutine 14 is trying to 
read from the buffer.
   
   The thing is that unlike most other `TTransport` implementations that all 
the I/O happened on the same goroutine, it's not the case for the http clients. 
If you follow the line numbers from the call stack, you'll notice that the http 
client actually spawned 2 goroutines for read and write (goroutine 14 is the 
one [for 
write](https://github.com/golang/go/blob/b0cb374daf646454998bac7b393f3236a2ab6aca/src/net/http/transport.go#L1238),
 as in it reads from the input (buffer) and writes them to the wire).
   
   So what happened is:
   
   1. The "main" goroutine (the one from the test) inside http client found out 
that the context is already canceled, *after* it already spawned the goroutines 
for the read and write loops. It returns the error after the context 
cancellation check (and triggered the deferred `Reset`)
   2. In the mean time, the write loop goroutine haven't done the cancellation 
check yet, so it still tries to read from the buffer.
   
   It's flaky on travis likely due to the resource limitation on the 
environment, that the main goroutine runs a lot slower, so when it detects 
context cancellation the write loop already did the check. I can almost 
reproduce it 100% on my local machine with go1.10.8 because I have more cores 
and the race condition is much more likely to happen.
   
   After go1.10.8 there're likely some change in stdlib http client to do the 
cancellation check earlier (I haven't dig into the details yet), but it doesn't 
mean it's safe to call `Reset` after we handed over the buffer to http client. 
Inside http client they have locks and stuff to guarantee that two goroutines 
don't try to access the buffer at the same time, but outside of that (in 
thrift's `THttpClient.Flush`) we are not part of that coordination. We should 
just give up the ownership of the buffer instead.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to