IMPALA-3633: cancel fragment if coordinator is gone

The bug is that return_val.status is an optional field, so setting
the status without __isset is equivalent to Status::OK(). This
meant that fragment did not get notified when reporting status
if the coordinator had gone away. This means that is a cancel
RPC was lost, we could be left with zombie fragments with no
coordinator that kept on running until completion.

Testing:
I couldn't see a way to replicate this reliably with our existing test
setup, since it requires some RPCs to be dropped to get into this state.
I manually tested by commenting out CancelRemoteFragments(), starting a
long-running query then cancelling it. Before the patch, perf top showed
that the fragments continue to execute the query. After the patch, the
fragments stopped executing quickly.

Change-Id: I62ab6f4df7c0ee60c6aa6291513f9f0cbfac3fe7
Reviewed-on: http://gerrit.cloudera.org:8080/3238
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Internal Jenkins


Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/4edb8bb6
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/4edb8bb6
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/4edb8bb6

Branch: refs/heads/master
Commit: 4edb8bb60de5264848b3f6934edce7df11a41a11
Parents: c76750a
Author: Tim Armstrong <[email protected]>
Authored: Fri May 27 09:07:26 2016 -0700
Committer: Tim Armstrong <[email protected]>
Committed: Tue May 31 23:32:12 2016 -0700

----------------------------------------------------------------------
 be/src/service/impala-server.cc | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/4edb8bb6/be/src/service/impala-server.cc
----------------------------------------------------------------------
diff --git a/be/src/service/impala-server.cc b/be/src/service/impala-server.cc
index 53e7dde..8f62171 100644
--- a/be/src/service/impala-server.cc
+++ b/be/src/service/impala-server.cc
@@ -1095,19 +1095,16 @@ void ImpalaServer::ReportExecStatus(
   // every report (assign each query a local int32_t id and use that to index 
into a
   // vector of QueryExecStates, w/o lookup or locking?)
   shared_ptr<QueryExecState> exec_state = GetQueryExecState(params.query_id, 
false);
-  // TODO: This is expected occasionally (since a report RPC might be in 
flight while
-  // cancellation is happening), but repeated instances for the same query are 
a bug
-  // (which we have occasionally seen). Consider keeping query exec states 
around for a
-  // little longer (until all reports have been received).
   if (exec_state.get() == NULL) {
-    return_val.status.__set_status_code(TErrorCode::INTERNAL_ERROR);
+    // This is expected occasionally (since a report RPC might be in flight 
while
+    // cancellation is happening). Return an error to the caller to get it to 
stop.
     const string& err = Substitute("ReportExecStatus(): Received report for 
unknown "
         "query ID (probably closed or cancelled). (query_id: $0, backend: $1, 
instance:"
         " $2 done: $3)", PrintId(params.query_id),
         params.instance_state_idx, PrintId(params.fragment_instance_id),
         params.done);
-    return_val.status.error_msgs.push_back(err);
-    // TODO: Re-enable logging when this only happens once per fragment.
+    Status(TErrorCode::INTERNAL_ERROR, err).SetTStatus(&return_val);
+    VLOG_QUERY << err;
     return;
   }
   
exec_state->coord()->UpdateFragmentExecStatus(params).SetTStatus(&return_val);

Reply via email to