[Impala-ASF-CR] IMPALA-2550: Switch to per-query exec rpc

Dan Hecht (Code Review) Sat, 06 May 2017 14:08:43 -0700

Dan Hecht has posted comments on this change.

Change subject: IMPALA-2550: Switch to per-query exec rpc
......................................................................



Patch Set 10:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/6535/10/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS10, Line 204: GetPrepareStatus
> the barrier isn't really present, because getfinstancestate already waits f
Still, it's not a pattern I think we should encourage in the code base 
(DCHECK(x) where x has side-effects).  How about at least making the 
non-barrier explicit by shortcircuiting it like this:

DCHECK(coord_instance_->IsPrepared() && 
coord_instance_->GetPrepareStatus().ok());


PS10, Line 413:  backend_state->GetStatus();
> good catch, this could be cancelled at this point.
In your current code, will the real status eventually be propagated as the 
query status? Or will this race cause the query to result in CANCELLED even 
though there was a real error?


PS10, Line 424: DCHECK(query_status_.ok()); // nobody should have been able to 
cancel
> when does the query get displayed on the debug page? you still haven't retu
It gets displayed as soon as there is a client request state registered, I 
believe. (i.e. RegisterQuery()). Which is what I meant - webserver cancellation 
is definitely available before exec() returns.


PS10, Line 964: #if 0
              : do we really want this?
> remove it unless someone says we want it.
fine with me to delete.


http://gerrit.cloudera.org:8080/#/c/6535/11/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS11, Line 200: // fragment instance's executor will not comple
DCHECK(coord_instance_->IsPrepared() && 
coord_instance_->GetPrepareStatus().ok());

to make it explicit that this has no side-effects.


http://gerrit.cloudera.org:8080/#/c/6535/10/be/src/runtime/query-exec-mgr.cc
File be/src/runtime/query-exec-mgr.cc:

PS10, Line 51: Prepare
> i think we should do some initial setup here so we can return an error stat
the renaming helped.


PS10, Line 59: exec-fragment-instance-$0
> i left everything as is, because i didn't want to break existing tests. hap
Which tests? CAn't they be fixed? You renamed the other thread - I think it 
would be best to do this one as well. The current name is really confusing 
because it's not plural.


http://gerrit.cloudera.org:8080/#/c/6535/11/be/src/runtime/query-exec-mgr.cc
File be/src/runtime/query-exec-mgr.cc:

PS11, Line 57: n)
QueryState reference


PS11, Line 59: exec-fragment-instance
start-query-finstances


http://gerrit.cloudera.org:8080/#/c/6535/10/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS10, Line 195:     // TODO: this might flood the log
              :     LOG(WARNING) << "Couldn't get a client for " << 
query_ctx().coord_address
              :         <<"\tReason: " << coord_status.GetDetail();
> address what, flooding the log?
Yes, flooding the log. i.e. the TODO would be better if it says what should 
happen in the future.

And before it looks like we'd remember the status via UpdateStatus() and then 
potentially report it back the next time we try (for periodic reports). Not 
saying that's the right solution but it does look like a subtle behavior change 
so want to think it through. I think the new behavior is probably not really 
worse though (for periodic reports, just report and try again later; for final 
report, in both old and new we'd drop it on the floor).


PS10, Line 255:     // TODO: should we try to keep rpc_status for the final 
report? (but the final
              :     // report, following this Cancel(), may not succeed anyway.)
> if the report rpc fails we fail the query on this node and then do nothing/
Okay. I think the old code used to remember this status (via UpdateStatus()) 
and then would report it back if it could talk to the coordinator later when 
the fragment is complete.

But I agree the protocol needs to be fixed.


Line 319:       prepare_status = instance_status;
> fis.h points out that all public non-getter functions block until the prepa
That fis.h comment was just added.

But I don't see how it's relevant to my initial comment here -- how can 
prepare_status be CANCELLED?


http://gerrit.cloudera.org:8080/#/c/6535/11/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS11, Line 211: arams.instance_exec_status.emplace_back();
              :     params.__isset.instance_exec_status = true;
but why doesn't the coordinator initiate cancellation if any FIS reports error? 
 Is this to simplify the coordinator logic? I guess I'm okay with how you have 
this currently, but it just seems clearer to me to keep FIS status in only one 
place.


PS11, Line 264: 
despite


http://gerrit.cloudera.org:8080/#/c/6535/10/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

PS10, Line 59: cancelled
> vague in what way?
Vague because it's not clear what all the side effects of "cancel" are, and 
when they take place. i.e. does it have an impact on resources, and if so, 
when? 

Really, just thinking ahead a bit, so let's leave this alone for now. It will 
probably get clarified with the release resources changes.


PS10, Line 112: Blocks until all fragment instances have finished
              :   /// their Prepare phase.
> it's the externally observable behavior, and qem::startqueryhelper takes ad
Okay.


PS10, Line 132: query is complete
> i agree the guarantees around resource release and lifetime need to be comp
Agree we'll revisit this in a later change.


http://gerrit.cloudera.org:8080/#/c/6535/11/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

PS11, Line 250: 
that's no longer defined; delete.


PS11, Line 260: 
This is subtle as to why we have it so I think it would be helpful to explain 
in what cases you shouldn't try to cancel - when FIS haven't started yet. Or 
even better would be to rename this to 'fis_started' since that's what the 
caller really needs to care about.


http://gerrit.cloudera.org:8080/#/c/6535/11/be/src/runtime/test-env.cc
File be/src/runtime/test-env.cc:

Line 83:   exec_env_.reset();
what's this about?


-- 
To view, visit http://gerrit.cloudera.org:8080/6535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I20769e420711737b6b385c744cef4851cee3facd
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2550: Switch to per-query exec rpc

Reply via email to