Hey Henry,

This is great. The rules are easy to read, easy to set, make failure
injection much easier so we could add tests to verify Impala's behavior. I
love that you can change the impala config dynamically without restart
cluster. Not only it saves testing time, also adding flexibility to tests.
I have a few questions:
1. could I set multiple debug rules for the same scope? for example, random
rpc failures on multiple rpcs, like first transimit data RPC fails, then
cancel fragment RPC fails as well.
2. how easy to repeat a test failure? some actions allow inject failure
randomly on fragments, If we could repeat the test with exactly the same
failure injection sequence, we could probably figure out what's wrong
faster.

Thanks,
Juan

On Fri, Sep 2, 2016 at 4:51 PM, Henry Robinson <[email protected]> wrote:

> Hi all -
>
> I want to start to get some feedback on a skunkworks project I did
> recently.
>
> Impala has long had 'debug options' which are small scripts that can inject
> a limited class of failures into query execution, mostly at the exec node
> scope (e.g. delaying Prepare(), returning an error during Open() etc).
> These are useful, but have a number of limitations:
>
> * Extending them is hard, because of the lack of regular language
> * The set of available actions is limited to waiting forever, or
> immediately cancelling.
> * Actions can't be composed or sequenced, or executed with some
> probability.
> * Actions are only applicable to exec nodes, and generalising them is hard.
> * Only one debug action may be applied per-query.
>
> I have wanted to make improvements here for a long time, not least because
> testing any RPC changes is hard, and am about done with the first cut that
> addresses all of the above issues and more. I'm calling them 'debug rules',
> and they have the following features:
>
>    1. Regular rule structure so that the language may be extended without
>    touching a parser.
>    2. JSON-based serialisation so that rules may be human readable and
>    (somewhat) writeable.
>    3. Python builder classes to make complex rule construction easy.
>    4. Extensible set of actions, which now includes setting a CLI flag at
>    runtime, logging a message, failing with a probability, sleeping, and
>    setting RPC timeouts.
>    5. Actions may have arguments which may themselves be functions.
>    Functions may access their environment in a limited way (so, for
> example,
>    they may log their fragment ID).
>    6. Actions may be applied at any point in Impalad, not just in exec
>    nodes.
>    7. Disabled completely in release builds.
>
>
> The full commit is the HEAD here: https://github.com/
> henryr/impala/tree/debug-rules. But I'd like some general feedback on the
> 'user experience', and with that in mind I'd like to point you to four
> locations:
>
>    1. Introductory documentation here: https://github.com/
>    henryr/Impala/blob/debug-rules/be/src/runtime/debug-rules.h#L25
>    <https://github.com/henryr/Impala/blob/debug-rules/be/
> src/runtime/debug-rules.h#L25>
>    2. A set of tests for failure modes after IMPALA-1599, which
>    parallelised fragment start-up: https://github.com/
>    henryr/Impala/blob/debug-rules/tests/rpc/test_1599.py
>    <https://github.com/henryr/Impala/blob/debug-rules/tests/
> rpc/test_1599.py>
>    3. A randomized test that introduces random delays to a subset of phases
>    in PHJ, PHA, scans and xchg operators: https://github.com/henryr/
>    Impala/blob/debug-rules/tests/rpc/test_exec_node_injection.py
>    <https://github.com/henryr/Impala/blob/debug-rules/tests/
> rpc/test_exec_node_injection.py>
>    4. A port of the existing test_rpc_timeouts.py test (that confirms the
>    RPC timeout mechanism that Juan wrote is working correctly) that doesn't
>    need to be a custom cluster test - debug rules do all the parameter and
>    flag setting: https://github.com/henryr/Impala/blob/debug-
>    rules/tests/rpc/test_timeouts.py
>    <https://github.com/henryr/Impala/blob/debug-rules/tests/
> rpc/test_timeouts.py>
>
> I like the last one particularly as it shaves more than 60s off the
> execution time of that test.
>
> The changes themselves are very large (~2000 lines and growing), and aren't
> remotely ready for review yet (you'll see half commented code all over the
> place) - but I'd like feedback on whether this is useful, would help with
> any testing scenarios you haven't been able to address yet, and if the
> abstractions and test APIs seem about right.
>
> Let me know what you think -
>
> Henry
>

Reply via email to