[
https://issues.apache.org/jira/browse/IMPALA-4555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715567#comment-16715567
]
Philip Zeyliger commented on IMPALA-4555:
-----------------------------------------
I ran into {{ReportExecStatus request on impala.ControlService from
172.26.26.40:59030 dropped due to backpressure. The service queue is full; it
has 2147483647 items.}} recently, which this would have helped with.
If we're saying that {{ReportExecStatus(done=false)}} is less important than
{{ReportExecStatus(done=true)}}, should we give them separate queues on the
coordinator side?
> Don't cancel query for failed ReportExecStatus (done=false) RPC
> ---------------------------------------------------------------
>
> Key: IMPALA-4555
> URL: https://issues.apache.org/jira/browse/IMPALA-4555
> Project: IMPALA
> Issue Type: Sub-task
> Components: Distributed Exec
> Affects Versions: Impala 2.7.0
> Reporter: Sailesh Mukil
> Assignee: Thomas Tauber-Marshall
> Priority: Major
>
> We currently try to send the ReportExecStatus RPC up to 3 times if the first
> 2 times are unsuccessful - due to high network load or a network partition.
> If all 3 attempts fail, we cancel the fragment instance and hence the query.
> However, we do not need to cancel the fragment instance if sending the report
> with _done=false_ failed. We can just skip this turn and try again the next
> time.
> We could probably skip sending the report up to 2 times (if we're unable to
> send due to high network load and if done=false) before succumbing to the
> current behavior, which is to cancel the fragment instance. The point is to
> try at a later time when the network load may be lower rather than try
> quickly again. The chance that the network load would reduce in 100 ms is
> less than in 5s.
> Also, we probably do not need to have the retry logic unless we've already
> skipped twice or if done=true.
> This could help reduce the network load on the coordinator for highly
> concurrent workloads.
> The only drawback I see now is that the QueryExecSummary might be stale for a
> while (which it would have anyway because the RPCs would have failed to send)
> P.S: This above proposed solution may need to change if we go ahead with
> IMPALA-2990.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]