[ 
https://issues.apache.org/jira/browse/DRILL-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325885#comment-15325885
 ] 

John Omernik commented on DRILL-4286:
-------------------------------------

Ahh... Great point.  Using seconds just to make thing easy for others to 
follow, the condition I missed was this:
Tx is the the number of seconds 
Q1 is our Query
N5 will be our node "to be drained

T0 -  Foreman Initiates planning for a query This includes all 10 nodes in a 
cluster N5 is included in this plan.
    N5 Status: RUN (Accepting work)
    Q1 Status: Planning 

T1 - Drain of Node 5 requests and is started
    N5 Status: DRAINING(Accepting work)
    Q1 Status: Planning

T2 - Drain of N5 completes and N5 is now Drained/Offline
    N5 Status: Drained (Not accepting work)
    Q1 Status: Planning

T3 - Q1 Planning completes and work is sent
    N5 Status: Drained (Not Accepting Work)
    Q1 Status: Error because it tried to send work to a node in a offline state 
    User Status:  Livid that their query failed
    Admin Status: Exasperated because a user is mad


I missed this case, and I am glad you spotted it, thanks!

So looking at your options to address this,  my gut reaction to the "stating 
intentions" option is that it seems fairly complex, and could open the door for 
issues. I don't want to remove it from options, but I have a personal  
preference for simple solutions when possible.   

This leads to the grace period option.  So basically, we could set a "no work 
grace period" option.   When a node is draining, the node will accept work for 
this grace period even if all current queries are complete.  I think this 
option could be a good way to approach things, but I'd lay out some caveats or 
"additional" work to catch edge cases.  

* The first thing I'd like to pose to the community is are there metrics for 
planning to execution, i.e. what are some acceptable delays here? So if I plan 
something does anyone have metrics on the average time from plan to work, or 
plan to max(time for a fragment to be submitted)?  On a VERY large 
query/cluster, could this extremely high?  If we set it to say 120 seconds 
grace period, how common would a plan that was made at T0 NOT have any work 
submitted in that 120 seconds?  Metrics, user stories here would be good. 

* The grace period would be an option in Drill and busy clusters could set this 
higher.  

* I think this grace period timer should ONLY start when all work on a bit is 
complete. Thus any work that comes in on the grace period would take the grace 
timer and reset back to the set grace period option, and that counter would 
start ticking when all work is again complete. Remember, only queries that was 
planned when the node was in a RUN state should actually be submitting work. 

* Thinking about where the time gaps could be (and please correct me if I am 
wrong). The planning should almost always complete in a "grace period" time (a 
long running planning operation is one of those things that I think is avoided 
in most cases), however, the work, on a large query may not.  I.e. the query 
could be planned in that time, but a specific fragment may not be submitted to 
the node until another fragment (say a slow one) is complete. That could leader 
to an issue. 

* So thinking about the previous point.  Could a message be developed that once 
planning is complete and since the foreman will know all work that is "yet" to 
complete, that the foreman can ping nodes to reset the timer?  This ping 
interval would also be an option for tuning on large or small clusters.   Thus, 
a small message that simply states, "You have work coming on this currently 
running query"  could reset the grace period timer. This could even be handled 
like a "empty fragment" I.e. from the drillbit perspective, it's "work" that 
requires no effort, but per a previous bullet, it resets the timer.   This 
would require less of a "reservation" or stating of intention that has to be 
recorded and handling by the node (Consider the situation that if a foreman 
states to a node "You'll have work coming, don't shutdown" but the 
foreman/query dies and it doesn't clean up that intention? Will that node ever 
go offline?) but just prolongs the grace period timer.   Basically, in this 
method, we allow the countdown timer on the draining to drained be reset, 
accepted work resets it, and simple message from a foreman can reset it.  But 
in the end, if no "foremen ping" or "accepted work" comes in, the node WILL 
pull itself out of operation cleaning up for potential error situations. 

I think when my gut thinks about the stating of intentions, my worry is about 
cleanup/other processes for tracking on the node.  I like the grace period 
option, but it needs to some ways to be extended to catch edge cases. I think  
as you state Paul, it can be done with a bit of thought

This is coming together.  I really believe while this may not be "neatest" of 
things to implement in Drill (I can squash bugs, or I can add features, or I 
can improve performance... all sexier things to work on) The importance of this 
feature can not be understated. Whether for a Drill stand alone cluster, a 
Mesos, or a Yarn implementation. This type of feature is critical to a well 
run, multi-tenant environment, especially those environments that have strong 
compliance policies for patching, maintenance etc in that this feature allows 
for integration into standard/automated systems enterprises have in place.  
This flexibility could be a feature that draws enterprises to Drill.  


> Have an ability to put server in quiescent mode of operation
> ------------------------------------------------------------
>
>                 Key: DRILL-4286
>                 URL: https://issues.apache.org/jira/browse/DRILL-4286
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Execution - Flow
>            Reporter: Victoria Markman
>
> I think drill will benefit from mode of operation that is called "quiescent" 
> in some databases. 
> From IBM Informix server documentation:
> {code}
> Change gracefully from online to quiescent mode
> Take the database server gracefully from online mode to quiescent mode to 
> restrict access to the database server without interrupting current 
> processing. After you perform this task, the database server sets a flag that 
> prevents new sessions from gaining access to the database server. The current 
> sessions are allowed to finish processing. After you initiate the mode 
> change, it cannot be canceled. During the mode change from online to 
> quiescent, the database server is considered to be in Shutdown mode.
> {code}
> This is different from shutdown, when processes are terminated. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to