[ 
https://issues.apache.org/jira/browse/FLINK-14074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932426#comment-16932426
 ] 

Alexander Kasyanenko commented on FLINK-14074:
----------------------------------------------

Hi [~till.rohrmann], I'm sorry for the delay on my end. I was able to look into 
this issue only today.

Firstly, I've prepared example logs, but I need this log to be validated (still 
in progress), so I can share it outside of company.

But anyway, I've built flink with extended logging and found out some 
interesting things.
1. This check: 
[https://github.com/apache/flink/blob/release-1.9.0/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosResourceManager.java#L436]
 has nothing to do with the problem, it isn't even called after the first job 
is cancelled/completed.

2. Last thing which happens during new slot allocation is this slot request 
submitted here: 
[https://github.com/apache/flink/blob/release-1.9.0/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L437],
 which I traced to method {{internalRequestSlot}} 
[https://github.com/apache/flink/blob/release-1.9.0/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java#L750]
 but method {{allocateResource}} was never called: 
[https://github.com/apache/flink/blob/release-1.9.0/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java#L795]

 

Right now I'm adding even more logging to see what is blocking slot allocation. 
I'll keep you posted, but as an update, I'm leaning towards some problems with 
pending slots.

> MesosResourceManager can't create new taskmanagers in Session Cluster Mode.
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-14074
>                 URL: https://issues.apache.org/jira/browse/FLINK-14074
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Mesos
>    Affects Versions: 1.9.0
>         Environment: Flink HA Session cluster 1.9.0 on mesos.
>            Reporter: Alexander Kasyanenko
>            Priority: Major
>
> Hi, I'm trying to launch multiple jobs in Flink Session Cluster, deployed on 
> mesos.
>  Flink's version is 1.9.0.
> The very first resource allocation completes successfully, and first 
> submitted job launches, but submitting any amount of jobs afterwords doesn't 
> affect the cluster in any way and no additional TaskManagers are allocated.
> From the logs I see that MesosResourceManager is requesting Slots for the 
> newly submitted jobs:  "{{o.a.f.m.r.c.MesosResourceManager - Request slot 
> with profile ResourceProfile..."}} but line {{"Starting a new worker.}}" 
> appears in log only the same amount of times as taskmanagers count, allocated 
> for the first job.
> I'm a complete noob in flink internals, but took a wild guess about a reason. 
> I think that the problem is in this check: 
> [https://github.com/apache/flink/blob/release-1.9.0/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosResourceManager.java#L436]
> It might be that RM is lazily allocated at the first call by a factory, and 
> then a private final field {{slotsPerWorker}} is set. So this check will 
> prevent creation of any new worker after iterator traverses the entire 
> collection. My main assumption is that {{slotsPerWorker}} is never modified 
> again.
>  
> I'm sorry that I didn't do much of investigation before reporting, but I'll 
> try to do some after a weekend. I plan to build flink without this check and 
> see if it helps. Also I'll play around with tests for this RM. Since it's my 
> time running time flink internals, I'll be back after a few days.
> Any help will much appreciated.
> Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to