[jira] [Resolved] (IMPALA-2638) Retry queries that fail during scheduling

Sahil Takiar (Jira) Fri, 15 May 2020 13:25:28 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sahil Takiar resolved IMPALA-2638.
----------------------------------
    Resolution: Duplicate

Closing as a duplicate because this use case is handled by Node Blacklisting 
(IMPALA-9299) and Transparent Query Retries (IMPALA-9124).

> Retry queries that fail during scheduling
> -----------------------------------------
>
>                 Key: IMPALA-2638
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2638
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Distributed Exec
>    Affects Versions: Impala 2.3.0
>            Reporter: Henry Robinson
>            Assignee: Sahil Takiar
>            Priority: Minor
>              Labels: scalability
>
> An important building block for node-decommissioning is the ability to retry 
> queries if they fail during scheduling for some recoverable reason (e.g. RPC 
> failed due to unreachable host, fragment could not be started due to memory 
> pressure). 
> To do this we can detect failures during {{Coordinator::Exec()}}, cancel the 
> running query and then re-start from somewhere in 
> {{QueryExecState::ExecQueryOrDmlRequest()}} - updating a local blacklist of 
> nodes so that we know to avoid those that have caused failures.
> There are some subtleties though:
> * Queries shouldn't be retried more than a small number of times, in case 
> they *cause* the outage (there might be a good way to figure that out at the 
> time)
> * If the query is restarted from the scheduling step (rather than completely 
> restarting), some care will have to be taken to ensure that none of the old 
> query's fragments that are being cancelled can affect the new query's 
> operation in any way (there are several ways to do this). 
> Eventually the failures will propagate to the rest of the cluster via the 
> statestore - this mechanism allows queries to recover and continue while the 
> statestore detects the failure. 
> This JIRA doesn't address restarting queries that have suffered failures 
> part-way through execution, because that's strictly harder and not (as) 
> needed for decommissioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (IMPALA-2638) Retry queries that fail during scheduling

Reply via email to