Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12297 )
Change subject: IMPALA-8138: Reintroduce rpc debugging options ...................................................................... Patch Set 1: (6 comments) Thanks for working on this patch. My general comment is to consider moving away from the existing "fake" fault injection framework in Thrift and use debug actions to simulate scenarios in which we may actually fail to exercise the entire RPC stack better. http://gerrit.cloudera.org:8080/#/c/12297/5/be/src/rpc/impala-control-service-proxy.h File be/src/rpc/impala-control-service-proxy.h: http://gerrit.cloudera.org:8080/#/c/12297/5/be/src/rpc/impala-control-service-proxy.h@44 PS5, Line 44: As mentioned elsewhere, this kind of artificial fault injection doesn't seem to be too useful. http://gerrit.cloudera.org:8080/#/c/12297/3/be/src/rpc/impala-control-service-proxy.h File be/src/rpc/impala-control-service-proxy.h: http://gerrit.cloudera.org:8080/#/c/12297/3/be/src/rpc/impala-control-service-proxy.h@43 PS3, Line 43: Not needed. Same below. http://gerrit.cloudera.org:8080/#/c/12297/5/be/src/rpc/rpc-mgr.inline.h File be/src/rpc/rpc-mgr.inline.h: http://gerrit.cloudera.org:8080/#/c/12297/5/be/src/rpc/rpc-mgr.inline.h@65 PS5, Line 65: : : : : : : : : : : : : It appears that this send vs recv debug actions were carried over from Thrift implementation. Retrospectively, the "fault injection" we did with Thrift was quite hacky (I am the culprit here) and it stemmed from the total lack of fault injection testing back then for exercising the error paths in Thrift RPC. As part of the KRPC development, we invested in proper fault injection testing by truly pausing the Impala and artificially creates various failure scenario. This allows a more extensive exercise across the entire RPC stack instead of just exercising the RPC handlers at the client and server sides. With the fault injection framework, it seems to be not too useful to continue with this path of artificial fault injection via debug action we used to do with Thrift. Instead, we may want to rethink the fault injection testing with KRPC. In particular, it may exercise the code better by doing some of the followings: - use debug actions to inject random delays in the RPC handlers. This is particularly useful for RPCs with timeout - use debug actions to randomly reject some of the incoming RPCs in ImpalaServicePool - use debug actions to respond with error status in the RPC handlers. The errors will be specific to each RPC handler (e.g. deserialization error of Thrift profiles, deserialization errors of RowBatch) - debug action to force some incoming RPCs to use deferred queue in KrpcDataStreamRecvr - (experimental / dangerous) "randomly" corrupt the incoming RPC payloads in ImpalaServicePool. - inject delay in RPC callback in the client side to simulate an overloaded client The above are some examples I can think of right now. For other failures, we may need to rely on the fault injection framework: - use iptables to drop all incoming packets to the RPC port from a particular host. This simulates a host which was powered off or network partitions - Restart remote Impalad will trigger the behavior of broken connections (by sending a RST packet) - Send SIGSTOP to remote Impalad (which we already do in the fault injection framework) to simulate non-responsive Impalad - other ideas... In general, my suggestion is to use debug actions to simulate failure which can actually happen instead of using this artificial fault injection which seems a bit meaningless at this point. http://gerrit.cloudera.org:8080/#/c/12297/1/be/src/runtime/query-state.h File be/src/runtime/query-state.h: http://gerrit.cloudera.org:8080/#/c/12297/1/be/src/runtime/query-state.h@314 PS1, Line 314: proxy_ > I assume that if I change the class name back to something ending in "Proxy Yup. http://gerrit.cloudera.org:8080/#/c/12297/5/be/src/service/control-service.cc File be/src/service/control-service.cc: http://gerrit.cloudera.org:8080/#/c/12297/5/be/src/service/control-service.cc@171 PS5, Line 171: if (qs.get() == nullptr) { : Status status(ErrorMsg(TErrorCode::INTERNAL_ERROR, Should this be converted to debug action ? http://gerrit.cloudera.org:8080/#/c/12297/5/be/src/service/control-service.cc@186 PS5, Line 186: : Should this be converted to debug action ? -- To view, visit http://gerrit.cloudera.org:8080/12297 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2046cb9dadf846ea90c04e95781b2bbde3325941 Gerrit-Change-Number: 12297 Gerrit-PatchSet: 1 Gerrit-Owner: Thomas Marshall <[email protected]> Gerrit-Reviewer: Andrew Sherman <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Thomas Marshall <[email protected]> Gerrit-Comment-Date: Tue, 16 Apr 2019 17:47:03 +0000 Gerrit-HasComments: Yes
