> > I'm an Impala dev working on replacing Thrift with krpc. One issue that > > recently came up is that we would like to have a simple way of simulating > > different types of failures of rpcs for testing purposes, and I was > > wondering if krpc already has anything like this built in, or if there's > > any interest in such a feature being implemented. > > > > In the past with Thrift, Impala did this by overriding automatically > > generated rpc functions to add debugging calls. I have a patch out > > currently to start doing this with the rpcs that we've ported to krpc so > > far: https://gerrit.cloudera.org/#/c/12297/ > > > > That patch would allow tests to be written that pass in options in the form > > "${RPC_NAME}:${ERROR_TYPE}@ARGS....", for example "CANCEL_QUERY:[email protected]", > > which would cause CancelQuery rpcs to fail with 50% probability. > > > > It was pointed out in the review that this could potentially be > > accomplished more cleanly by modifying the code that generates the proxy > > definitions, e.g. protoc-gen-krpc.cc. We could always just make those > > modifications in the copy of krpc's code that is checked in to Impala, but > > we'd like to minimize divergence, and of course its always nice to share > > code/effort where possible. > > > > I'm not against adding the ability to hook the proxy classes, so long as > it's perf-neutral when not enabled. I would think you'd want the ability to > fault an RPC both before it gets sent (so it is never delivered) and also > to block the response (so the server does process it but the client doesn't > realize). It looks like your patch did that. That would help suss out cases > where you have retries without ensuring proper idempotency, etc. > > Another option would be to put the changes in the generic 'Proxy' class -- > or make it possible to pass your own Proxy subclass instance when > constructing a generated proxy. I think that's cleaner than modifying the > codegen with hooks. > > That said, I dont think we'd make much use of it on the Kudu side. We do > have a few places we do fault injection like the above, but more often our > fault injection works by starting multiple processes and actually > controlling the forked daemons by signals or otherwise making it crash by > remotely setting fault injection "crash_on_..." type flags. These kinds of > faults are a bit more realistic since after a node crashes it will have to > restart, go back to initial states, etc. It also ensures that we get > correlated-in-time failures across all different RPCs headed for the host, > which can trigger interesting behavior on clients who might have multiple > outstanding requests to the crashed one.
What about simulating network partitions? Do you think we would use what Thomas' describing for that? Or use something like nftables? I can't tell if what's being proposed is client-only; if it is, then it probably wouldn't be appropriate for network partitions.
