[ 
https://issues.apache.org/jira/browse/IMPALA-4555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763159#comment-16763159
 ] 

ASF subversion and git services commented on IMPALA-4555:
---------------------------------------------------------

Commit b1e4957ba78ef496d21728606889d1eb83ef6b27 in impala's branch 
refs/heads/master from Thomas Tauber-Marshall
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b1e4957 ]

IMPALA-4555: Make QueryState's status reporting more robust

QueryState periodically collects runtime profiles from all of its
fragment instances and sends them to the coordinator. Previously, each
time this happens, if the rpc fails, QueryState will retry twice after
a configurable timeout and then cancel the fragment instances under
the assumption that the coordinator no longer exists.

We've found in real clusters that this logic is too sensitive to
failed rpcs and can result in fragment instances being cancelled even
in cases where the coordinator is still running.

This patch makes a few improvements to this logic:
- When a report fails to send, instead of retrying the same report
  quickly (after waiting report_status_retry_interval_ms), we wait the
  regular reporting interval (status_report_interval_ms), regenerate
  any stale portions of the report, and then retry.
- A new flag, --status_report_max_retries, is introduced, which
  controls the number of failed reports that are allowed before the
  query is cancelled. --report_status_retry_interval_ms is removed.
- Backoff is used for repeated failed attempts, such that for a period
  between retries of 't', on try 'n' the actual timeout will be t * n.

Testing:
- Added a test which results in a large number of failed intermediate
  status reports but still succeeds.

Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Reviewed-on: http://gerrit.cloudera.org:8080/12049
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Don't cancel query for failed ReportExecStatus (done=false) RPC
> ---------------------------------------------------------------
>
>                 Key: IMPALA-4555
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4555
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Distributed Exec
>    Affects Versions: Impala 2.7.0
>            Reporter: Sailesh Mukil
>            Assignee: Thomas Tauber-Marshall
>            Priority: Major
>
> We currently try to send the ReportExecStatus RPC up to 3 times if the first 
> 2 times are unsuccessful - due to high network load or a network partition. 
> If all 3 attempts fail, we cancel the fragment instance and hence the query.
> However, we do not need to cancel the fragment instance if sending the report 
> with _done=false_ failed. We can just skip this turn and try again the next 
> time.
> We could probably skip sending the report up to 2 times (if we're unable to 
> send due to high network load and if done=false) before succumbing to the 
> current behavior, which is to cancel the fragment instance. The point is to 
> try at a later time when the network load may be lower rather than try 
> quickly again. The chance that the network load would reduce in 100 ms is 
> less than in 5s.
> Also, we probably do not need to have the retry logic unless we've already 
> skipped twice or if done=true.
> This could help reduce the network load on the coordinator for highly 
> concurrent workloads.
> The only drawback I see now is that the QueryExecSummary might be stale for a 
> while (which it would have anyway because the RPCs would have failed to send)
> P.S: This above proposed solution may need to change if we go ahead with 
> IMPALA-2990.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to