[jira] [Commented] (IMPALA-4555) Don't cancel query for failed ReportExecStatus (done=false) RPC

Philip Zeyliger (JIRA) Mon, 10 Dec 2018 13:24:03 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-4555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715567#comment-16715567
 ]


Philip Zeyliger commented on IMPALA-4555:
-----------------------------------------

I ran into {{ReportExecStatus request on impala.ControlService from 
172.26.26.40:59030 dropped due to backpressure. The service queue is full; it 
has 2147483647 items.}} recently, which this would have helped with.

If we're saying that {{ReportExecStatus(done=false)}} is less important than 
{{ReportExecStatus(done=true)}}, should we give them separate queues on the 
coordinator side?

> Don't cancel query for failed ReportExecStatus (done=false) RPC
> ---------------------------------------------------------------
>
>                 Key: IMPALA-4555
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4555
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Distributed Exec
>    Affects Versions: Impala 2.7.0
>            Reporter: Sailesh Mukil
>            Assignee: Thomas Tauber-Marshall
>            Priority: Major
>
> We currently try to send the ReportExecStatus RPC up to 3 times if the first 
> 2 times are unsuccessful - due to high network load or a network partition. 
> If all 3 attempts fail, we cancel the fragment instance and hence the query.
> However, we do not need to cancel the fragment instance if sending the report 
> with _done=false_ failed. We can just skip this turn and try again the next 
> time.
> We could probably skip sending the report up to 2 times (if we're unable to 
> send due to high network load and if done=false) before succumbing to the 
> current behavior, which is to cancel the fragment instance. The point is to 
> try at a later time when the network load may be lower rather than try 
> quickly again. The chance that the network load would reduce in 100 ms is 
> less than in 5s.
> Also, we probably do not need to have the retry logic unless we've already 
> skipped twice or if done=true.
> This could help reduce the network load on the coordinator for highly 
> concurrent workloads.
> The only drawback I see now is that the QueryExecSummary might be stale for a 
> while (which it would have anyway because the RPCs would have failed to send)
> P.S: This above proposed solution may need to change if we go ahead with 
> IMPALA-2990.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-4555) Don't cancel query for failed ReportExecStatus (done=false) RPC

Reply via email to