[
https://issues.apache.org/jira/browse/IMPALA-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sahil Takiar resolved IMPALA-2638.
----------------------------------
Resolution: Duplicate
Closing as a duplicate because this use case is handled by Node Blacklisting
(IMPALA-9299) and Transparent Query Retries (IMPALA-9124).
> Retry queries that fail during scheduling
> -----------------------------------------
>
> Key: IMPALA-2638
> URL: https://issues.apache.org/jira/browse/IMPALA-2638
> Project: IMPALA
> Issue Type: Improvement
> Components: Distributed Exec
> Affects Versions: Impala 2.3.0
> Reporter: Henry Robinson
> Assignee: Sahil Takiar
> Priority: Minor
> Labels: scalability
>
> An important building block for node-decommissioning is the ability to retry
> queries if they fail during scheduling for some recoverable reason (e.g. RPC
> failed due to unreachable host, fragment could not be started due to memory
> pressure).
> To do this we can detect failures during {{Coordinator::Exec()}}, cancel the
> running query and then re-start from somewhere in
> {{QueryExecState::ExecQueryOrDmlRequest()}} - updating a local blacklist of
> nodes so that we know to avoid those that have caused failures.
> There are some subtleties though:
> * Queries shouldn't be retried more than a small number of times, in case
> they *cause* the outage (there might be a good way to figure that out at the
> time)
> * If the query is restarted from the scheduling step (rather than completely
> restarting), some care will have to be taken to ensure that none of the old
> query's fragments that are being cancelled can affect the new query's
> operation in any way (there are several ways to do this).
> Eventually the failures will propagate to the rest of the cluster via the
> statestore - this mechanism allows queries to recover and continue while the
> statestore detects the failure.
> This JIRA doesn't address restarting queries that have suffered failures
> part-way through execution, because that's strictly harder and not (as)
> needed for decommissioning.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]