[ 
https://issues.apache.org/jira/browse/IMPALA-4555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766743#comment-16766743
 ] 

ASF subversion and git services commented on IMPALA-4555:
---------------------------------------------------------

Commit 9492d451d5d5a82bfc6f4c93c3a0c6e6d0cc4981 in impala's branch 
refs/heads/master from Thomas Tauber-Marshall
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=9492d45 ]

IMPALA-8183: fix test_reportexecstatus_retry flakiness

The test is designed to cause ReportExecStatus() rpcs to fail by
backing up the control service queue. Previously, after a failed
ReportExecStatus() we would wait 'report_status_retry_interval_ms'
between retries, which was 100ms by default and wasn't touched by the
test. That 100ms was right on the edge of being enough time for the
coordinator to keep up with processing the reports, so that some would
fail but most would succeed. It was always possible that we could hit
IMPALA-2990 in this setup, but it was unlikely.

Now, with IMPALA-4555 'report_status_retry_interval_ms' was removed
and we instead wait 'status_report_interval_ms' between retries. By
default, this is 5000ms, so it should give the coordinator even more
time and make these issues less likely. However, the test sets
'status_report_interval_ms' to 10ms, which isn't nearly enough time
for the coordinator to do its processing, causing lots of the
ReportExecStatus() rpcs to fail and making us hit IMPALA-2990 pretty
often.

The solution is to set 'status_report_interval_ms' to 100ms in the
test, which roughly achieves the same retry frequency as before. The
same change is made to a similar test test_reportexecstatus_timeout.

Testing:
- Ran test_reportexecstatus_retry in a loop 400 times without seeing a
  failure. It previously repro-ed for me about once per 50 runs.
- Manually verified that both tests are still hitting the error paths
  that they are supposed to be testing.

Change-Id: I7027a6e099c543705e5845ee0e5268f1f9a3fb05
Reviewed-on: http://gerrit.cloudera.org:8080/12461
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> Don't cancel query for failed ReportExecStatus (done=false) RPC
> ---------------------------------------------------------------
>
>                 Key: IMPALA-4555
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4555
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Distributed Exec
>    Affects Versions: Impala 2.7.0
>            Reporter: Sailesh Mukil
>            Assignee: Thomas Tauber-Marshall
>            Priority: Major
>             Fix For: Impala 3.2.0
>
>
> We currently try to send the ReportExecStatus RPC up to 3 times if the first 
> 2 times are unsuccessful - due to high network load or a network partition. 
> If all 3 attempts fail, we cancel the fragment instance and hence the query.
> However, we do not need to cancel the fragment instance if sending the report 
> with _done=false_ failed. We can just skip this turn and try again the next 
> time.
> We could probably skip sending the report up to 2 times (if we're unable to 
> send due to high network load and if done=false) before succumbing to the 
> current behavior, which is to cancel the fragment instance. The point is to 
> try at a later time when the network load may be lower rather than try 
> quickly again. The chance that the network load would reduce in 100 ms is 
> less than in 5s.
> Also, we probably do not need to have the retry logic unless we've already 
> skipped twice or if done=true.
> This could help reduce the network load on the coordinator for highly 
> concurrent workloads.
> The only drawback I see now is that the QueryExecSummary might be stale for a 
> while (which it would have anyway because the RPCs would have failed to send)
> P.S: This above proposed solution may need to change if we go ahead with 
> IMPALA-2990.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to