[
https://issues.apache.org/jira/browse/IMPALA-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108619#comment-17108619
]
ASF subversion and git services commented on IMPALA-9199:
---------------------------------------------------------
Commit bd4d01a379fa483baf7524c0effb981c2fb0c742 in impala's branch
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=bd4d01a ]
IMPALA-9199: Add support for single query retries on cluster membership changes
Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).
Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.
A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.
Core Features:
* Retries are transparent to the user; no modification to client
libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
default
Implementation:
* When a query is retried, the original query is cancelled, the new
query is created, registered, and started, and then the original query
is closed
* A new layer of abstraction between the ImpalaServer and
ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
re-used for each query retry
* QueryDrivers and ClientRequestStates are now referenced using a
QueryHandle
Observability:
* Users can tell if a query is retried using runtime profiles and the
Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
"RETRIED" state
Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build
Limitations:
* There are several limitations that are listed out in the parent JIRA
Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Reviewed-on: http://gerrit.cloudera.org:8080/14824
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Sahil Takiar <[email protected]>
> Add support for single query retries on cluster membership changes
> ------------------------------------------------------------------
>
> Key: IMPALA-9199
> URL: https://issues.apache.org/jira/browse/IMPALA-9199
> Project: IMPALA
> Issue Type: Sub-task
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Major
>
> If the cluster membership changes (either because the statestore detects that
> a node has left the cluster, or a node is added to the blacklist), then
> rather than cancelling / failing queries running on the target node, retry
> them.
> This JIRA focuses on just retrying queries once.
> There should be a query level option to control whether queries are retried
> or not.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]