[jira] [Commented] (IMPALA-9124) Transparently retry queries that fail due to cluster membership changes

ASF subversion and git services (Jira) Fri, 15 May 2020 13:12:27 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-9124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108620#comment-17108620
 ]


ASF subversion and git services commented on IMPALA-9124:
---------------------------------------------------------

Commit bd4d01a379fa483baf7524c0effb981c2fb0c742 in impala's branch 
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=bd4d01a ]

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry
* QueryDrivers and ClientRequestStates are now referenced using a
  QueryHandle

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
    * "Retry Status: RETRIED"
    * "Retry Cause: [the error that triggered the retry]"
    * "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
    * "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Reviewed-on: http://gerrit.cloudera.org:8080/14824
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Sahil Takiar <[email protected]>


> Transparently retry queries that fail due to cluster membership changes
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-9124
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9124
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Backend, Clients
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Critical
>         Attachments: Impala Transparent Query Retries.pdf
>
>
> Currently, if the Impala Coordinator or any Executors run into errors during 
> query execution, Impala will fail the entire query. It would improve user 
> experience to transparently retry the query for some transient, recoverable 
> errors.
> This JIRA focuses on retrying queries that would otherwise fail due to 
> cluster membership changes. Specifically, node failures that cause changes in 
> the cluster membership (currently the Coordinator cancels all queries running 
> on a node if it detects that the node is no longer part of the cluster) and 
> node blacklisting (the Coordinator blacklists a node because it detects a 
> problem with that node - can’t execute RPCs against the node). It is not 
> focused on retrying general errors (e.g. any frontend errors, 
> MemLimitExceeded exceptions, etc.).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-9124) Transparently retry queries that fail due to cluster membership changes

Reply via email to