This is an automated email from the ASF dual-hosted git repository.

awong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit 04e584c62e52f8d196ddbba93783007d2fc02a01
Author: Will Berkeley <[email protected]>
AuthorDate: Thu Mar 14 17:16:33 2019 -0700

    Increase timeout in tls_socket-test
    
    Very rarely (~3/2000 times in TSAN with 8 stress threads),
    tls_socket-test will fail with an log like the following:
    
    I0314 19:20:54.118880   236 tls_socket-test.cc:109] server: negotiation 
complete
    I0314 19:20:54.119151   223 tls_socket-test.cc:109] client: negotiation 
complete
    I0314 19:21:04.127199   236 tls_socket-test.cc:165] server echoing 33406976 
bytes
    /data/6/wdberkeley/kudu/src/kudu/security/tls_socket-test.cc:234: Failure
    Failed
    Bad status: Network error: BlockingRecv error: failed to read from TLS 
socket (remote: unknown): Connection reset by peer (error 104)
    
    It seems the following is happening:
    
    1. The client and the echo server connect successfully.
    2. The client sends its payload of 32MiB (33554432 bytes) in
       BlockingWrite.
    3. The server, while looping in BlockingRecv receiving the payload and
       through some combination of resource saturation, unfavorable
       scheduling, and EINTR returns from recv, fails to read the whole
       payload before timing out. Notice the 10 second delay between the
       second and third messages (the timeout is 10s) and the number of
       bytes being echoed of < 32MiB.
    4. The server terminates the connection because of the timeout, but this
       does not result in a failure on its side because the server was
       stopped by the client.
    5. The client fails when it first tries to BlockingRecv from the
       closed connection, instead of on the second BlockingRecv as the test
       intends.
    
    This seems like a test-only issue- the time out on the server side
    seems like reasonable behavior. Since it's so rare, tripling the timeout
    should hopefully make the issue stop or at least make it much, much
    rarer. With a 10s timeout, 2000 runs on TSAN, and 8 stress threads, I saw
    2-4 failures. With a 30s timeout, I see 0.
    
    Change-Id: Ibc615ea8f03a74f38b2bd6f3b4c140b3e435d4f3
    Reviewed-on: http://gerrit.cloudera.org:8080/12761
    Reviewed-by: Alexey Serbin <[email protected]>
    Tested-by: Kudu Jenkins
    Reviewed-by: Adar Dembo <[email protected]>
---
 src/kudu/security/tls_socket-test.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/kudu/security/tls_socket-test.cc 
b/src/kudu/security/tls_socket-test.cc
index f609ce3..b88cdf4 100644
--- a/src/kudu/security/tls_socket-test.cc
+++ b/src/kudu/security/tls_socket-test.cc
@@ -58,7 +58,7 @@ using std::vector;
 namespace kudu {
 namespace security {
 
-const MonoDelta kTimeout = MonoDelta::FromSeconds(10);
+const MonoDelta kTimeout = MonoDelta::FromSeconds(30);
 
 // Size is big enough to not fit into output socket buffer of default size
 // (controlled by setsockopt() with SO_SNDBUF).

Reply via email to