Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10813 )

Change subject: IMPALA-7163: Implement a state machine for the QueryState class
......................................................................


Patch Set 2:

(12 comments)

Some initial high level comments on the states and the synchronization.

http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@66
PS2, Line 66: f any fragment instance hits an error or cancellation, then we 
immediately change the
            : /// state of the query to the corresponding error state.
but won't we need to wait until all FIS have completed (even if it's because 
they were cancelled when the first FIS failed) before sending the final report?


http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@194
PS2, Line 194: pair<Status, TUniqueId>
rather than using a pair<>, how about defining a struct so that the fields have 
meaningful names?


http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@245
PS2, Line 245: 2&
we generally only use const references. For things that will be modified, use a 
pointer so it's obvious at the callsite.


http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@245
PS2, Line 245: pair<Status, TUniqueId>
Use FInstanceStatus


http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@262
PS2, Line 262: to another thread that
             :     // called this function.
That doesn't seem right in the "less than 0" case, since this function only 
ever sets the count to 0. I think the less than case requires racing with both 
this function and DoneWithState(). Anyway, I think these details all go away if 
you can use the CountingBarrier<> idea.


http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@289
PS2, Line 289:     INITIALIZED,
             :     /// STARTED: All fragment instances have been started. 
Implies that the query is
             :     /// being prepared.
             :     STARTED,
             :     /// PREPARED: All fragment instances managed by this 
QueryState have successfully
             :     /// completed Prepare(). Implies that the query is Opening.
             :     PREPARED,
             :     /// OPENED: All fragment instances managed by this 
QueryState have successfully
             :     /// completed Open(). Implies that the query is executing.
             :     OPENED,
why do we need all of these states? it's a bit hard to see how they will be 
used when the code to use them isn't here yet. could you at least explain that 
or reduce the number of states in this patch (and introduce them back later as 
needed)?


http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@300
PS2, Line 300: The QueryState is ready for tear-down at this state
what about the other terminal states? when is the QueryState ready for teardown 
in those cases?


http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@312
PS2, Line 312: only expect a single thread to call this function.
that's the defintion of not thread safe :) It be more helpful to explain what 
mechanism guarantees that this is only called by a single thread. i.e. this is 
a helper to MonitorFInstances() presumably, and up there document that that 
thing should be called by a single thread.


http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@337
PS2, Line 337: Is
The "Is" sounds like it returns a bool, yet the return type is void. What does 
"validate" mean? If this means DCHECK, just call it 
DCheckLegalExecStateTransition()


http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.h@344
PS2, Line 344:   AtomicInt32 num_remaining_finstances_preparing_;
             :   AtomicInt32 num_remaining_finstances_opening_;
             :   AtomicInt32 num_remaining_finstances_executing_;
how are these going to be used? it seems like we'd want them to be 
CountingBarriers (that the monitoring thread will do timed blocks on). Oh, I 
guess we'll use the Promises to do that.

But then, it seems what we really want is to have a CountingBarrier<T> where 
our existing CountingBarrier is CountingBarrier<bool> and this code uses 
CountingBarrier<FInstanceStatus>.

Then, I think your DoneWithState() is just CountingBarrier::Notify(OK) and 
ErrorDuringState is CountingBarrier::NotifyRemaining(fis_error).


http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.cc@437
PS2, Line 437:
             :   if (!prepare_status.first.ok()) {
             :     // If 'prepare_status' is not OK, then one of the fragment 
instances has hit an error.
             :     // Don't return until every instance is prepared and record 
the first non-OK
             :     // (non-CANCELLED if available) status.
             :     for (auto entry : fis_map_) {
             :       Status instance_status = entry.second->WaitForPrepare();
             :       // don't wipe out an error in one instance with the 
resulting CANCELLED from
             :       // the remaining instances
             :       if (!instance_status.ok() && 
prepare_status.first.IsCancelled()) {
             :         prepare_status.first = instance_status;
             :         prepare_status.second = entry.first;
             :       }
             :     }
             :   }
why do we need this special case logic? i.e it seems like this goes against the 
documentation that says that error state transitions occur when the first 
thread hits an error.


http://gerrit.cloudera.org:8080/#/c/10813/2/be/src/runtime/query-state.cc@586
PS2, Line 586:   // initiate cancellation if nobody has done so yet
             :   if (!status.ok()) Cancel();
             :   // decrement refcount taken in StartFInstances()
             :   ReleaseExecResourceRefcount();
             :   // decrement refcount taken in StartFInstances()
             :   
ExecEnv::GetInstance()->query_exec_mgr()->ReleaseQueryState(this);
this stuff should be integrated as responses to transitions of the state 
machine. Seems like we shouldn't need the exec resources ref count anymore 
since we have ref counts for the states.



--
To view, visit http://gerrit.cloudera.org:8080/10813
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iec5670a7db83ecae4656d7bb2ea372d3767ba7fe
Gerrit-Change-Number: 10813
Gerrit-PatchSet: 2
Gerrit-Owner: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-Comment-Date: Tue, 26 Jun 2018 17:01:37 +0000
Gerrit-HasComments: Yes

Reply via email to