On Thu, Jan 31, 2019 at 3:11 PM Thomas Tauber-Marshall <[email protected]> wrote:
> I'm an Impala dev working on replacing Thrift with krpc. One issue that > recently came up is that we would like to have a simple way of simulating > different types of failures of rpcs for testing purposes, and I was > wondering if krpc already has anything like this built in, or if there's > any interest in such a feature being implemented. > > In the past with Thrift, Impala did this by overriding automatically > generated rpc functions to add debugging calls. I have a patch out > currently to start doing this with the rpcs that we've ported to krpc so > far: https://gerrit.cloudera.org/#/c/12297/ > > That patch would allow tests to be written that pass in options in the form > "${RPC_NAME}:${ERROR_TYPE}@ARGS....", for example "CANCEL_QUERY:[email protected]", > which would cause CancelQuery rpcs to fail with 50% probability. > > It was pointed out in the review that this could potentially be > accomplished more cleanly by modifying the code that generates the proxy > definitions, e.g. protoc-gen-krpc.cc. We could always just make those > modifications in the copy of krpc's code that is checked in to Impala, but > we'd like to minimize divergence, and of course its always nice to share > code/effort where possible. > I'm not against adding the ability to hook the proxy classes, so long as it's perf-neutral when not enabled. I would think you'd want the ability to fault an RPC both before it gets sent (so it is never delivered) and also to block the response (so the server does process it but the client doesn't realize). It looks like your patch did that. That would help suss out cases where you have retries without ensuring proper idempotency, etc. Another option would be to put the changes in the generic 'Proxy' class -- or make it possible to pass your own Proxy subclass instance when constructing a generated proxy. I think that's cleaner than modifying the codegen with hooks. That said, I dont think we'd make much use of it on the Kudu side. We do have a few places we do fault injection like the above, but more often our fault injection works by starting multiple processes and actually controlling the forked daemons by signals or otherwise making it crash by remotely setting fault injection "crash_on_..." type flags. These kinds of faults are a bit more realistic since after a node crashes it will have to restart, go back to initial states, etc. It also ensures that we get correlated-in-time failures across all different RPCs headed for the host, which can trigger interesting behavior on clients who might have multiple outstanding requests to the crashed one. -Todd -- Todd Lipcon Software Engineer, Cloudera
