[jira] [Commented] (FLINK-12260) Slot allocation failure by taskmanager registration timeout and race

Hwanju Kim (JIRA) Tue, 30 Apr 2019 21:58:48 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830862#comment-16830862
 ]


Hwanju Kim commented on FLINK-12260:
------------------------------------

Thanks for the clarification. I thought you meant it without introducing any 
additional map, but now it seems clear.

I had tried thinking conservative approach as I couldn't 100% rule out the 
possibility of sender-side race. As we may have a potential simpler solution, I 
looked at the code again a little further. Initially what led me to any 
possibility of race is this part:

 
{code:java}
val a = PromiseActorRef(ref.provider, timeout, targetName = actorRef, 
message.getClass.getName, sender)
actorRef.tell(message, a)
a.result.future
{code}
This is internalAsk from invokeRpc and PromiseActorRef internally does 
scheduler.scheduleOnce(timeout.duration) for the timer. The tell of actorRef is 
sending a message to RM through RemoteActorRef and EndpointManager where the 
message is passed to Dispatcher, which enqueues the message to mbox and 
executes mbox via executor thread. My impression was that as tell is 
asynchronous via executor service, the timer of PromiseActorRef set up before 
can fire before the message hit the road off the sender. Although that'd be 
possible, the message at least seems to be enqueued to mbox for RM endpoint and 
thus the order can be preserved against the next attempt after timeout. So, the 
ordering seems fine. In addition I was also concerned the case where two 
different ask calls might happen to use two different TCP connections leading 
any possible out-of-order delivery. Although not 100% exercising the relevant 
code, it seems to use a single connection associated by akka endpoints and I 
checked that's true by packet capture. 

So, based on the code inspection and no successful repro on sender-side, we can 
currently conclude that the race is likely happening in task executor 
connection/handshake on the receiver-side (as repro does). I will test it out 
with the Till's proposal. On our side, once this fix ends up being applied, we 
can keep eyes on our test apps, which intermittently hit this issue, to see if 
there's any other race issue.

 

 

> Slot allocation failure by taskmanager registration timeout and race
> --------------------------------------------------------------------
>
>                 Key: FLINK-12260
>                 URL: https://issues.apache.org/jira/browse/FLINK-12260
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.3
>            Reporter: Hwanju Kim
>            Priority: Critical
>         Attachments: FLINK-12260-repro.diff
>
>
>  
> In 1.6.2., we have seen slot allocation failure keep happening for long time. 
> Having looked at the log, I see the following behavior:
>  # TM sends a registration request R1 to resource manager.
>  # R1 times out after 100ms, which is initial timeout.
>  # TM retries a registration request R2 to resource manager (with timeout 
> 200ms).
>  # R2 arrives first at resource manager and registered, and then TM gets 
> successful response moving onto step 5 below.
>  # On successful registration, R2's instance is put to 
> taskManagerRegistrations
>  # Then R1 arrives at resource manager and realizes the same TM resource ID 
> is already registered, which then unregisters R2's instance ID from 
> taskManagerRegistrations. A new instance ID for R1 is registered to 
> workerRegistration.
>  # R1's response is not handled though since it already timed out (see akka 
> temp actor resolve failure below), hence no registration to 
> taskManagerRegistrations.
>  # TM keeps heartbeating to the resource manager with slot status.
>  # Resource manager ignores this slot status, since taskManagerRegistrations 
> contains R2, not R1, which replaced R2 in workerRegistration at step 6.
>  # Slot request can never be fulfilled, timing out.
> The following is the debug logs for the above steps:
>  
> {code:java}
> JM log:
> 2019-04-11 22:39:40.000,Registering TaskManager with ResourceID 
> 46c8e0d0fcf2c306f11954a1040d5677 
> (akka.ssl.tcp://flink@flink-taskmanager:6122/user/taskmanager_0) at 
> ResourceManager
> 2019-04-11 22:39:40.000,Registering TaskManager 
> 46c8e0d0fcf2c306f11954a1040d5677 under deade132e2c41c52019cdc27977266cf at 
> the SlotManager.
> 2019-04-11 22:39:40.000,Replacing old registration of TaskExecutor 
> 46c8e0d0fcf2c306f11954a1040d5677.
> 2019-04-11 22:39:40.000,Unregister TaskManager 
> deade132e2c41c52019cdc27977266cf from the SlotManager.
> 2019-04-11 22:39:40.000,Registering TaskManager with ResourceID 
> 46c8e0d0fcf2c306f11954a1040d5677 
> (akka.ssl.tcp://flink@flink-taskmanager:6122/user/taskmanager_0) at 
> ResourceManager
> TM log:
> 2019-04-11 22:39:40.000,Registration at ResourceManager attempt 1 
> (timeout=100ms)
> 2019-04-11 22:39:40.000,Registration at ResourceManager 
> (akka.ssl.tcp://flink@flink-jobmanager:6123/user/resourcemanager) attempt 1 
> timed out after 100 ms
> 2019-04-11 22:39:40.000,Registration at ResourceManager attempt 2 
> (timeout=200ms)
> 2019-04-11 22:39:40.000,Successful registration at resource manager 
> akka.ssl.tcp://flink@flink-jobmanager:6123/user/resourcemanager under 
> registration id deade132e2c41c52019cdc27977266cf.
> 2019-04-11 22:39:41.000,resolve of path sequence [/temp/$c] failed{code}
>  
> As RPC calls seem to use akka ask, which creates temporary source actor, I 
> think multiple RPC calls could've arrived out or order by different actor 
> pairs and the symptom above seems to be due to that. If so, it could have 
> attempt account in the call argument to prevent unexpected unregistration? At 
> this point, what I have done is only log analysis, so I could do further 
> analysis, but before that wanted to check if it's a known issue. I also 
> searched with some relevant terms and log pieces, but couldn't find the 
> duplicate. Please deduplicate if any.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12260) Slot allocation failure by taskmanager registration timeout and race

Reply via email to