[
https://issues.apache.org/jira/browse/IMPALA-9124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108620#comment-17108620
]
ASF subversion and git services commented on IMPALA-9124:
---------------------------------------------------------
Commit bd4d01a379fa483baf7524c0effb981c2fb0c742 in impala's branch
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=bd4d01a ]
IMPALA-9199: Add support for single query retries on cluster membership changes
Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).
Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.
A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.
Core Features:
* Retries are transparent to the user; no modification to client
libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
default
Implementation:
* When a query is retried, the original query is cancelled, the new
query is created, registered, and started, and then the original query
is closed
* A new layer of abstraction between the ImpalaServer and
ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
re-used for each query retry
* QueryDrivers and ClientRequestStates are now referenced using a
QueryHandle
Observability:
* Users can tell if a query is retried using runtime profiles and the
Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
"RETRIED" state
Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build
Limitations:
* There are several limitations that are listed out in the parent JIRA
Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Reviewed-on: http://gerrit.cloudera.org:8080/14824
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Sahil Takiar <[email protected]>
> Transparently retry queries that fail due to cluster membership changes
> -----------------------------------------------------------------------
>
> Key: IMPALA-9124
> URL: https://issues.apache.org/jira/browse/IMPALA-9124
> Project: IMPALA
> Issue Type: New Feature
> Components: Backend, Clients
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Critical
> Attachments: Impala Transparent Query Retries.pdf
>
>
> Currently, if the Impala Coordinator or any Executors run into errors during
> query execution, Impala will fail the entire query. It would improve user
> experience to transparently retry the query for some transient, recoverable
> errors.
> This JIRA focuses on retrying queries that would otherwise fail due to
> cluster membership changes. Specifically, node failures that cause changes in
> the cluster membership (currently the Coordinator cancels all queries running
> on a node if it detects that the node is no longer part of the cluster) and
> node blacklisting (the Coordinator blacklists a node because it detects a
> problem with that node - can’t execute RPCs against the node). It is not
> focused on retrying general errors (e.g. any frontend errors,
> MemLimitExceeded exceptions, etc.).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]