Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10813 )
Change subject: IMPALA-7163: Implement a state machine for the QueryState class ...................................................................... Patch Set 2: (12 comments) Some initial high level comments on the states and the synchronization. http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h File be/src/runtime/query-state.h: http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@66 PS2, Line 66: f any fragment instance hits an error or cancellation, then we immediately change the : /// state of the query to the corresponding error state. but won't we need to wait until all FIS have completed (even if it's because they were cancelled when the first FIS failed) before sending the final report? http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@194 PS2, Line 194: pair<Status, TUniqueId> rather than using a pair<>, how about defining a struct so that the fields have meaningful names? http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@245 PS2, Line 245: 2& we generally only use const references. For things that will be modified, use a pointer so it's obvious at the callsite. http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@245 PS2, Line 245: pair<Status, TUniqueId> Use FInstanceStatus http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@262 PS2, Line 262: to another thread that : // called this function. That doesn't seem right in the "less than 0" case, since this function only ever sets the count to 0. I think the less than case requires racing with both this function and DoneWithState(). Anyway, I think these details all go away if you can use the CountingBarrier<> idea. http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@289 PS2, Line 289: INITIALIZED, : /// STARTED: All fragment instances have been started. Implies that the query is : /// being prepared. : STARTED, : /// PREPARED: All fragment instances managed by this QueryState have successfully : /// completed Prepare(). Implies that the query is Opening. : PREPARED, : /// OPENED: All fragment instances managed by this QueryState have successfully : /// completed Open(). Implies that the query is executing. : OPENED, why do we need all of these states? it's a bit hard to see how they will be used when the code to use them isn't here yet. could you at least explain that or reduce the number of states in this patch (and introduce them back later as needed)? http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@300 PS2, Line 300: The QueryState is ready for tear-down at this state what about the other terminal states? when is the QueryState ready for teardown in those cases? http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@312 PS2, Line 312: only expect a single thread to call this function. that's the defintion of not thread safe :) It be more helpful to explain what mechanism guarantees that this is only called by a single thread. i.e. this is a helper to MonitorFInstances() presumably, and up there document that that thing should be called by a single thread. http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@337 PS2, Line 337: Is The "Is" sounds like it returns a bool, yet the return type is void. What does "validate" mean? If this means DCHECK, just call it DCheckLegalExecStateTransition() http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@344 PS2, Line 344: AtomicInt32 num_remaining_finstances_preparing_; : AtomicInt32 num_remaining_finstances_opening_; : AtomicInt32 num_remaining_finstances_executing_; how are these going to be used? it seems like we'd want them to be CountingBarriers (that the monitoring thread will do timed blocks on). Oh, I guess we'll use the Promises to do that. But then, it seems what we really want is to have a CountingBarrier<T> where our existing CountingBarrier is CountingBarrier<bool> and this code uses CountingBarrier<FInstanceStatus>. Then, I think your DoneWithState() is just CountingBarrier::Notify(OK) and ErrorDuringState is CountingBarrier::NotifyRemaining(fis_error). http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.cc File be/src/runtime/query-state.cc: http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.cc@437 PS2, Line 437: : if (!prepare_status.first.ok()) { : // If 'prepare_status' is not OK, then one of the fragment instances has hit an error. : // Don't return until every instance is prepared and record the first non-OK : // (non-CANCELLED if available) status. : for (auto entry : fis_map_) { : Status instance_status = entry.second->WaitForPrepare(); : // don't wipe out an error in one instance with the resulting CANCELLED from : // the remaining instances : if (!instance_status.ok() && prepare_status.first.IsCancelled()) { : prepare_status.first = instance_status; : prepare_status.second = entry.first; : } : } : } why do we need this special case logic? i.e it seems like this goes against the documentation that says that error state transitions occur when the first thread hits an error. http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.cc@586 PS2, Line 586: // initiate cancellation if nobody has done so yet : if (!status.ok()) Cancel(); : // decrement refcount taken in StartFInstances() : ReleaseExecResourceRefcount(); : // decrement refcount taken in StartFInstances() : ExecEnv::GetInstance()->query_exec_mgr()->ReleaseQueryState(this); this stuff should be integrated as responses to transitions of the state machine. Seems like we shouldn't need the exec resources ref count anymore since we have ref counts for the states. -- To view, visit http://gerrit.cloudera.org:8080/10813 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iec5670a7db83ecae4656d7bb2ea372d3767ba7fe Gerrit-Change-Number: 10813 Gerrit-PatchSet: 2 Gerrit-Owner: Sailesh Mukil <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-Comment-Date: Tue, 26 Jun 2018 17:01:37 +0000 Gerrit-HasComments: Yes
