[ 
https://issues.apache.org/jira/browse/DRILL-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15324398#comment-15324398
 ] 

John Omernik commented on DRILL-4286:
-------------------------------------

Paul and I had an offline discussion on this as well, so I will repeat some 
things I mentioned to to Paul before. 

I like the idea of of a state, in my post to Paul, I added a znode and created 
a "desired" state value. The reason for this I'll explain a bit below, however, 
I will say I am not a zookeeper expert so having a znode that drillbits watched 
was one of those things that sounded good on the surface, but I worried about 
performance of say a 1000 node cluster.  To support my idea, and before I 
explain it, I would state we could have a "desired_state.poll.interval.seconds" 
configuration variable added which would be the interval a drill bit would poll 
it's znode to determine the desired state.  This interval would start out with 
random(int(0-desired_state_poll.interval.seconds)) (That's not any language, 
just a way to represent that the first poll would be a random number of seconds 
between 0 and the poll interval so there would be some staggering of the 
requests).   

Ok znode1: state

As Paul said, "not set", "START", "RUN", "DRAIN", "STOP".  My initial 
suggestion did not have a "Not Set" I.e. when the drill bit registered 
initially, it always registered with "start" and only changed to "run" when 
everything was healthy. Also, I didn't have "STOP", instead I had "DRAINING" 
(in addition to DRAINED) I think Paul's DRAIN maybe my "DRAINING" and Paul's 
"STOP" may be my "DRAINED" If that is so, then I think we should discuss this. 
A Drill Bit that is drained is not "Stopped" It's still running, and I want to 
be clear it's state.  What I am doing is have an idea that a Bit can be 
running, but not accepting queries, and not in a "shutting down" mode.  This 
may assist in future use cases with Troubleshooting, or other administration 
tasks.  Also, the state of "DRAINING" is different from that of "DRAINED" in 
how the administrator looks at things. 

znode2: desired_state

Like I said in my first paragraph, I am a bit worried my lack of understanding 
of Zookeeper may preclude this, however, I think there are some advantages 
here.  As I wrote to Paul, It's nice that we have the SIGTERM methodology built 
in, but that's a course tool. First, it assumes that the "desired" state is 
only shutdown and to do so while draining queries, it's also a "bit only" 
feature, as Paul said, it doesn't stop other nodes from trying to include that 
node in the query. So what does that do from a failure perspective? I.e. If a 
different node that is foreman, plans a query including that node right now, 
does the shutting down node know more work is coming, or could there be a race 
condition where the shutting down node, believes it to be done, so it exits, 
and then other foreman sends work to a dead node... i.e. failed query.   More 
so, I don't like SIGTERM because we as an initiator, because we need to let the 
cluster know of that drillbit's state as well. Edge Case: we have a node in a 
bad state, we send sigterm to it, and it ignores it, for whatever reason, will 
other foremans still assign work?  Could we get into wonky cluster states 
because of that?  In addition, when looking at Paul's idea with a REST option 
for remote shutdown, we have to assume that the node is in a good state, it has 
to be thing accepting the control to start the draining command. Thus, if you 
sent a Rest command to drain, and that node was in a halfway state or a state 
where it didn't follow through on the request, other bits may still work to 
that node. Especially if for whatever reason that hinky drillbit couldn't 
update it's state. 

So, my solution is to use a "desired_state" (please also see the "heartbeat" 
note below). 

We aim to deprecate the SIGTERM methodology. This is a cluster of computers, 
sending remote SIGTERMS is not something I think makes sense at scale. Instead, 
we have in the WebUI and the RestAPI, as Paul stated, the State, and then my 
additional Desired state.  Any Admin user can update the Desired state of any 
node in the cluster.  This is done through a simple API call, (and check of 
permissions).  Nodes start with a Default desired state of "RUN" (Although as I 
mentioned to Paul, I think we could add an option such as 
"drillbit.default.desired_state" which by default is set to RUN. This way an 
administrator, if they have reasons, could start drillbits say in a "drained" 
state. I.e. if during "START" the desired_state is "DRAINED" the bit would move 
to this state rather than "RUN".  

A healthy drill bit will poll it's desired state with the poll interval above,  
and will always try to achieve it's desired state.  Thus if the state is RUN, 
and it sees the desired_state change to "DRAINED" on the next poll, it will 
change it's state to DRAINING until queries are done, and other nodes, when 
scheduling queries, could read the current state of all bits, and if NOT "RUN" 
than don't include in the planning.  This helps the potential  race condition 
that is currently in the SIGTERM  method. 

So, with the "heartbeat" I have mentioned above, I've seen some posts 
mentioning a heartbeat mechanism, however, I am in the dark on how it can work. 
 A new foreman, when submitting a queries the two znodes (state and desired 
state) if either of them is not "RUN", then it wouldn't include the bit in the 
query.  If however, both are RUN, and the foreman goes to schedule, and 
something errors out on that nodes work, or if some heartbeat check fails on 
work submission, the foreman could set the "state" to "Error/Unknown"  This 
would help other queries quickly ignore this bit for future queries. Now, the 
conditions that could put a node state into "Error/Unknown" would have be well 
monitored, to ensure we don't have nodes dropping for the wrong reasons, but 
this could help the overall stability of the cluster in that new work would not 
be sent to this bit of unknown state.   In addition, once a node is in this 
state, only it can change that state.  The state should only be changed by the 
node itself, unless that state change is based on an error/unknown condition.  

Overall I think this approach would provide stability and flexibility when you 
have weird hardware issues, memory issues, etc across a cluster, it would allow 
admins to easily manually select nodes for draining, or moving out of operation 
for testing, log gathering, stack traces etc.  In addition, the changing of 
state is cluster wide operation, both in how the node learns about it's desired 
state change AND how the other nodes learn about cluster state changes. 

This approach would also not require any changes to YARN to work.  SIGTERM 
could be supported for healthy nodes, but the logic just changed to start the 
draining process via znode update, and then when the state changes from 
draining to drained, the SIGTERM method would exit the process. Basically, 
replicating what is happening now, while using the framework (and keeping other 
nodes from sending jobs to the draining node). 

I would be very interested in discussion on this, this is a challenge for other 
SQL on Hadoop tools, and really is need feature for a high availability cluster 
that still has the ability to be administrated, patched, etc. 



> Have an ability to put server in quiescent mode of operation
> ------------------------------------------------------------
>
>                 Key: DRILL-4286
>                 URL: https://issues.apache.org/jira/browse/DRILL-4286
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Execution - Flow
>            Reporter: Victoria Markman
>
> I think drill will benefit from mode of operation that is called "quiescent" 
> in some databases. 
> From IBM Informix server documentation:
> {code}
> Change gracefully from online to quiescent mode
> Take the database server gracefully from online mode to quiescent mode to 
> restrict access to the database server without interrupting current 
> processing. After you perform this task, the database server sets a flag that 
> prevents new sessions from gaining access to the database server. The current 
> sessions are allowed to finish processing. After you initiate the mode 
> change, it cannot be canceled. During the mode change from online to 
> quiescent, the database server is considered to be in Shutdown mode.
> {code}
> This is different from shutdown, when processes are terminated. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to