IMPALA-3633: cancel fragment if coordinator is gone The bug is that return_val.status is an optional field, so setting the status without __isset is equivalent to Status::OK(). This meant that fragment did not get notified when reporting status if the coordinator had gone away. This means that is a cancel RPC was lost, we could be left with zombie fragments with no coordinator that kept on running until completion.
Testing: I couldn't see a way to replicate this reliably with our existing test setup, since it requires some RPCs to be dropped to get into this state. I manually tested by commenting out CancelRemoteFragments(), starting a long-running query then cancelling it. Before the patch, perf top showed that the fragments continue to execute the query. After the patch, the fragments stopped executing quickly. Change-Id: I62ab6f4df7c0ee60c6aa6291513f9f0cbfac3fe7 Reviewed-on: http://gerrit.cloudera.org:8080/3238 Reviewed-by: Tim Armstrong <[email protected]> Tested-by: Internal Jenkins Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/4edb8bb6 Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/4edb8bb6 Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/4edb8bb6 Branch: refs/heads/master Commit: 4edb8bb60de5264848b3f6934edce7df11a41a11 Parents: c76750a Author: Tim Armstrong <[email protected]> Authored: Fri May 27 09:07:26 2016 -0700 Committer: Tim Armstrong <[email protected]> Committed: Tue May 31 23:32:12 2016 -0700 ---------------------------------------------------------------------- be/src/service/impala-server.cc | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/4edb8bb6/be/src/service/impala-server.cc ---------------------------------------------------------------------- diff --git a/be/src/service/impala-server.cc b/be/src/service/impala-server.cc index 53e7dde..8f62171 100644 --- a/be/src/service/impala-server.cc +++ b/be/src/service/impala-server.cc @@ -1095,19 +1095,16 @@ void ImpalaServer::ReportExecStatus( // every report (assign each query a local int32_t id and use that to index into a // vector of QueryExecStates, w/o lookup or locking?) shared_ptr<QueryExecState> exec_state = GetQueryExecState(params.query_id, false); - // TODO: This is expected occasionally (since a report RPC might be in flight while - // cancellation is happening), but repeated instances for the same query are a bug - // (which we have occasionally seen). Consider keeping query exec states around for a - // little longer (until all reports have been received). if (exec_state.get() == NULL) { - return_val.status.__set_status_code(TErrorCode::INTERNAL_ERROR); + // This is expected occasionally (since a report RPC might be in flight while + // cancellation is happening). Return an error to the caller to get it to stop. const string& err = Substitute("ReportExecStatus(): Received report for unknown " "query ID (probably closed or cancelled). (query_id: $0, backend: $1, instance:" " $2 done: $3)", PrintId(params.query_id), params.instance_state_idx, PrintId(params.fragment_instance_id), params.done); - return_val.status.error_msgs.push_back(err); - // TODO: Re-enable logging when this only happens once per fragment. + Status(TErrorCode::INTERNAL_ERROR, err).SetTStatus(&return_val); + VLOG_QUERY << err; return; } exec_state->coord()->UpdateFragmentExecStatus(params).SetTStatus(&return_val);
