Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12672 )

Change subject: IMPALA-8143: Enhance DoRpcWithRetry().
......................................................................


Patch Set 2:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.h
File be/src/rpc/rpc-mgr-test.h:

http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.h@296
PS5, Line 296: ERR
typo


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc
File be/src/rpc/rpc-mgr-test.cc:

http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@214
PS5, Line 214:   ASSERT_OK(rpc_mgr_.RegisterService(10, 10, scan_mem_impl,
             :       static_cast<
Not sure if it's output of clang-tidy or something but I find it a bit hard to 
ready to split it like this. Can't we keep these 2 lines unchanged ?


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@263
PS5, Line 263: ueue f
typo


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@291
PS5, Line 291:
nullptr;

Also do you want to assert that it's zero ?


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@298
PS5, Line 298: // Configure the service
not used ?


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@316
PS5, Line 316: ASSERT_OK(ping_
Instead of sharing "sleep_ms" between RPCs, it may be cleaner to pass the sleep 
time as RPC argument instead.


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@322
PS5, Line 322: // A lambda t
proxy->


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@333
PS5, Line 333:
             :   // wait for async call to be running
             :   sleep_started.Wait();
             :   ASSERT_EQ(0, get_current_queue_size(rpc_mgr_));
             :
             :   // Do a second async call that will fill up the queue
             :   RpcController async_controller2;
             :   CountingBarrier async2_done(1);
             :   // Reset the sleep time
             :   sleep_ms = 100;
             :   // Set the deadline to something reasonable otherwise the 
pings from DoRpcWithRetry
             :   // will replace this async call in the queue.
             :   MonoTime deadline = MonoTime::Now() + 
MonoDelta::FromSeconds(10);
             :   async_controller2.set_deadline(deadline);
             :   ResponseCallback async2_callback = [&]() { 
async2_done.Notify(); };
             :   proxy.get()->PingAsync(request, &response, &async_controller2, 
async2_callback);
             :
             :   // Wait for second async to get queued
             :   SleepForMs(300);
             :   // Check the queue size
             :   ASSERT_EQ(1, get_current_queue_size(
These set of tests seem rather timing dependent. For instance, it may occur 
that the first time DoRpcWithRetry() is called will succeed because the async 
call somehow finishes by then due to some weird scheduling order. May be better 
to rethink a better way to test it.


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@378
PS5, Line 378:   ASSERT_FALSE(async2_done.pending());
             :
             :   // Test injection of failures with DebugAction.
             :   
query_ctx.client_request.query_options.__set_debug_action("DoRpcWithRetry:FAIL");
             :   Status inject_status = RpcMgr::DoRpcWithRetry(ping_rpc, 
query_ctx, "ping failed", 1,
             :       MonoDelta::FromSeconds(100), 0, "DoRpcWithRetry");
             :   ASSERT_FALSE(inject_status.ok());
             :   EXPECT_ERROR(inject_status, TErrorCo
Please see comments in rpc-mgr.inline.h that we probably should only retry if 
the remote server is busy. For all other errors, we probably should report it 
right away to callers. We can consider transparently handling other types of 
errors in the future.


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h
File be/src/rpc/rpc-mgr.h:

http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@178
PS5, Line 178:   void ToJson(rapidjson::Document* document);
             :
             :   /// Retry the Rpc 'rpc_call' up to 'times_to_try' times.
             :   /// Each Rpc has a timeout of 'timeout'.
             :   /// If the service is busy then sleep 'se
Not needed as discussed in person.


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@189
PS5, Line 189: the R
typename


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@191
PS5, Line 191: n the 'l
It's important to highlight that this only works iff rpc_call is idempotent in 
the comments above. Any non-idempotent rpc_call using this interface may have 
undesired consequence.


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@193
PS5, Line 193: ebugAction() to potentially inject errors.
             :   template <typename F, typename L>
It seems inconsistent to pass the timeout in as MonoDelta while passing the 
backoff time in second. To be consistent with the rest of the code in Impala, 
it may be better to pass the timeout as int64_t instead (e.g int64_t 
timeout_ms, int64_t backoff_ms)

In addition, the backoff's unit is second, which may be too coarse grained. I 
would suggest using millisecond instead.


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@201
PS5, Line 201: ) before destroying RpcMgr";
             :   }
             :
             :  private:
             :   /// One pool per registered service. sc
Although this function is inlined, It has grown over time to take too many 
arguments. It may benefit from some refactoring if possible.

One simplification is to considering hiding the query_ctx and debug action 
together into a simple lambda function which gets invoked inside 
DoRpcWithRetry().


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h
File be/src/rpc/rpc-mgr.inline.h:

http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@57
PS5, Line 57: times_to_try, const Mono
May make sense to group this argument next to query_ctx as they are related.


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@61
PS5, Line 61: , d
++i


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@72
PS5, Line 72:
Do we need to reset response between retry ?


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@76
PS5, Line 76:       // Check for injected failures.
            :       rpc_
Based on our offline discussion, this is not needed. Please consider removing 
it.


http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@79
PS5, Line 79:     }
            :
            :     const kudu::Status& k_status = rpc_call(&rpc_controller);
            :     rpc_status = FromKuduStatus(k_status, error_msg);
            :     i
So, it's not clear to me whether we should retry for errors other than 
RpcMgr::IsServerTooBusy(). In other words, if we hit other types of fatal 
errors, should we report them back to the caller right away ?

Another possible way to treat it is that the code will retry regardless of the 
error types, assuming that any kind of error was due to some transient errors 
which may go away. In which case, it may make sense to backoff regardless of 
the error types, right ?

What's your thought ?



--
To view, visit http://gerrit.cloudera.org:8080/12672
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia9693151c35e02235665b3c285a48c585973d390
Gerrit-Change-Number: 12672
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Sherman <asher...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tmarsh...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Mar 2019 00:15:45 +0000
Gerrit-HasComments: Yes

Reply via email to