[jira] [Commented] (IMPALA-9199) Add support for single query retries on cluster membership changes

ASF subversion and git services (Jira) Fri, 15 May 2020 13:12:27 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108619#comment-17108619
 ]


ASF subversion and git services commented on IMPALA-9199:
---------------------------------------------------------

Commit bd4d01a379fa483baf7524c0effb981c2fb0c742 in impala's branch 
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=bd4d01a ]

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry
* QueryDrivers and ClientRequestStates are now referenced using a
  QueryHandle

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
    * "Retry Status: RETRIED"
    * "Retry Cause: [the error that triggered the retry]"
    * "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
    * "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Reviewed-on: http://gerrit.cloudera.org:8080/14824
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Sahil Takiar <[email protected]>


> Add support for single query retries on cluster membership changes
> ------------------------------------------------------------------
>
>                 Key: IMPALA-9199
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9199
>             Project: IMPALA
>          Issue Type: Sub-task
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>
> If the cluster membership changes (either because the statestore detects that 
> a node has left the cluster, or a node is added to the blacklist), then 
> rather than cancelling / failing queries running on the target node, retry 
> them.
> This JIRA focuses on just retrying queries once.
> There should be a query level option to control whether queries are retried 
> or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-9199) Add support for single query retries on cluster membership changes

Reply via email to