[
https://issues.apache.org/jira/browse/DRILL-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15324398#comment-15324398
]
John Omernik commented on DRILL-4286:
-------------------------------------
Paul and I had an offline discussion on this as well, so I will repeat some
things I mentioned to to Paul before.
I like the idea of of a state, in my post to Paul, I added a znode and created
a "desired" state value. The reason for this I'll explain a bit below, however,
I will say I am not a zookeeper expert so having a znode that drillbits watched
was one of those things that sounded good on the surface, but I worried about
performance of say a 1000 node cluster. To support my idea, and before I
explain it, I would state we could have a "desired_state.poll.interval.seconds"
configuration variable added which would be the interval a drill bit would poll
it's znode to determine the desired state. This interval would start out with
random(int(0-desired_state_poll.interval.seconds)) (That's not any language,
just a way to represent that the first poll would be a random number of seconds
between 0 and the poll interval so there would be some staggering of the
requests).
Ok znode1: state
As Paul said, "not set", "START", "RUN", "DRAIN", "STOP". My initial
suggestion did not have a "Not Set" I.e. when the drill bit registered
initially, it always registered with "start" and only changed to "run" when
everything was healthy. Also, I didn't have "STOP", instead I had "DRAINING"
(in addition to DRAINED) I think Paul's DRAIN maybe my "DRAINING" and Paul's
"STOP" may be my "DRAINED" If that is so, then I think we should discuss this.
A Drill Bit that is drained is not "Stopped" It's still running, and I want to
be clear it's state. What I am doing is have an idea that a Bit can be
running, but not accepting queries, and not in a "shutting down" mode. This
may assist in future use cases with Troubleshooting, or other administration
tasks. Also, the state of "DRAINING" is different from that of "DRAINED" in
how the administrator looks at things.
znode2: desired_state
Like I said in my first paragraph, I am a bit worried my lack of understanding
of Zookeeper may preclude this, however, I think there are some advantages
here. As I wrote to Paul, It's nice that we have the SIGTERM methodology built
in, but that's a course tool. First, it assumes that the "desired" state is
only shutdown and to do so while draining queries, it's also a "bit only"
feature, as Paul said, it doesn't stop other nodes from trying to include that
node in the query. So what does that do from a failure perspective? I.e. If a
different node that is foreman, plans a query including that node right now,
does the shutting down node know more work is coming, or could there be a race
condition where the shutting down node, believes it to be done, so it exits,
and then other foreman sends work to a dead node... i.e. failed query. More
so, I don't like SIGTERM because we as an initiator, because we need to let the
cluster know of that drillbit's state as well. Edge Case: we have a node in a
bad state, we send sigterm to it, and it ignores it, for whatever reason, will
other foremans still assign work? Could we get into wonky cluster states
because of that? In addition, when looking at Paul's idea with a REST option
for remote shutdown, we have to assume that the node is in a good state, it has
to be thing accepting the control to start the draining command. Thus, if you
sent a Rest command to drain, and that node was in a halfway state or a state
where it didn't follow through on the request, other bits may still work to
that node. Especially if for whatever reason that hinky drillbit couldn't
update it's state.
So, my solution is to use a "desired_state" (please also see the "heartbeat"
note below).
We aim to deprecate the SIGTERM methodology. This is a cluster of computers,
sending remote SIGTERMS is not something I think makes sense at scale. Instead,
we have in the WebUI and the RestAPI, as Paul stated, the State, and then my
additional Desired state. Any Admin user can update the Desired state of any
node in the cluster. This is done through a simple API call, (and check of
permissions). Nodes start with a Default desired state of "RUN" (Although as I
mentioned to Paul, I think we could add an option such as
"drillbit.default.desired_state" which by default is set to RUN. This way an
administrator, if they have reasons, could start drillbits say in a "drained"
state. I.e. if during "START" the desired_state is "DRAINED" the bit would move
to this state rather than "RUN".
A healthy drill bit will poll it's desired state with the poll interval above,
and will always try to achieve it's desired state. Thus if the state is RUN,
and it sees the desired_state change to "DRAINED" on the next poll, it will
change it's state to DRAINING until queries are done, and other nodes, when
scheduling queries, could read the current state of all bits, and if NOT "RUN"
than don't include in the planning. This helps the potential race condition
that is currently in the SIGTERM method.
So, with the "heartbeat" I have mentioned above, I've seen some posts
mentioning a heartbeat mechanism, however, I am in the dark on how it can work.
A new foreman, when submitting a queries the two znodes (state and desired
state) if either of them is not "RUN", then it wouldn't include the bit in the
query. If however, both are RUN, and the foreman goes to schedule, and
something errors out on that nodes work, or if some heartbeat check fails on
work submission, the foreman could set the "state" to "Error/Unknown" This
would help other queries quickly ignore this bit for future queries. Now, the
conditions that could put a node state into "Error/Unknown" would have be well
monitored, to ensure we don't have nodes dropping for the wrong reasons, but
this could help the overall stability of the cluster in that new work would not
be sent to this bit of unknown state. In addition, once a node is in this
state, only it can change that state. The state should only be changed by the
node itself, unless that state change is based on an error/unknown condition.
Overall I think this approach would provide stability and flexibility when you
have weird hardware issues, memory issues, etc across a cluster, it would allow
admins to easily manually select nodes for draining, or moving out of operation
for testing, log gathering, stack traces etc. In addition, the changing of
state is cluster wide operation, both in how the node learns about it's desired
state change AND how the other nodes learn about cluster state changes.
This approach would also not require any changes to YARN to work. SIGTERM
could be supported for healthy nodes, but the logic just changed to start the
draining process via znode update, and then when the state changes from
draining to drained, the SIGTERM method would exit the process. Basically,
replicating what is happening now, while using the framework (and keeping other
nodes from sending jobs to the draining node).
I would be very interested in discussion on this, this is a challenge for other
SQL on Hadoop tools, and really is need feature for a high availability cluster
that still has the ability to be administrated, patched, etc.
> Have an ability to put server in quiescent mode of operation
> ------------------------------------------------------------
>
> Key: DRILL-4286
> URL: https://issues.apache.org/jira/browse/DRILL-4286
> Project: Apache Drill
> Issue Type: New Feature
> Components: Execution - Flow
> Reporter: Victoria Markman
>
> I think drill will benefit from mode of operation that is called "quiescent"
> in some databases.
> From IBM Informix server documentation:
> {code}
> Change gracefully from online to quiescent mode
> Take the database server gracefully from online mode to quiescent mode to
> restrict access to the database server without interrupting current
> processing. After you perform this task, the database server sets a flag that
> prevents new sessions from gaining access to the database server. The current
> sessions are allowed to finish processing. After you initiate the mode
> change, it cannot be canceled. During the mode change from online to
> quiescent, the database server is considered to be in Shutdown mode.
> {code}
> This is different from shutdown, when processes are terminated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)