[ 
https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864569#comment-15864569
 ] 

David McLaughlin commented on AURORA-1890:
------------------------------------------

You're right, the write volume is totally dependent on your update volume and 
the pulse interval. For many use cases, the cost of the update would be 
negligible. I think the real concern was the cost of reading the last pulse 
time. 

One other reason why persisting the pulse is not super useful is the scheduler 
failover time typically exceeds a sane pulse timeout. The same applies to 
automatically setting it to the last event time (which would be preferable 
IMO). I think the reason we backed out of the grace period change (which was 
going to be achieved by setting the timestamp to scheduler acquiring leadership 
timestamp) is that it would potentially reactivate a bunch of updates that were 
legitimately blocked. In the end, we agreed the churn from ROLLING_FORWARD -> 
BLOCKED_AWAITING_PULSE -> ROLLING_FORWARD was harmless. But I suppose if you 
have automation on top of this that reacts to state changes, it could be 
annoying. 

> Job Update Pulse History is not durably stored
> ----------------------------------------------
>
>                 Key: AURORA-1890
>                 URL: https://issues.apache.org/jira/browse/AURORA-1890
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and 
> state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is 
> reset to no pulse received.
> We can solve this by durably storing the timestamp of the last pulse received 
> in storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to