[
https://issues.apache.org/jira/browse/FLINK-14074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933284#comment-16933284
]
Alexander Kasyanenko commented on FLINK-14074:
----------------------------------------------
Yeah, looks suspiciously similar to FLINK-13241. I'll try to look into this
similarity.
I'll add a little bit of info about my test-case:
I deploy a simple word-count job, then wait for its completion, which results
in something like this:
[https://gist.github.com/Atlaster/c4ebc39e02006c2fe4daf6fedcb22481]
I'm logging method names and a little bit extra like this:
[https://gist.github.com/Atlaster/1a50229e62cc710ebcac2734b98e0f32]
[~trohrmann], can you please correct me if I'm wrong?
When taskmanager is removed (I assumed that it was removed based by logs I
have) internalUnregisterTaskManager method
[https://github.com/apache/flink/blob/release-1.9.0/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java#L1146]
is called. When it supposed to remove all slots ac cancel all pending slots
here
[https://github.com/apache/flink/blob/release-1.9.0/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java#L937].
But when new job is started there're still 4 pending slots (1 taskmanager has
4 slots in my configuration).
Isn't flink supposed to remove all pending slots when taskmanager is removed?
> MesosResourceManager can't create new taskmanagers in Session Cluster Mode.
> ---------------------------------------------------------------------------
>
> Key: FLINK-14074
> URL: https://issues.apache.org/jira/browse/FLINK-14074
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Mesos
> Affects Versions: 1.9.0
> Environment: Flink HA Session cluster 1.9.0 on mesos.
> Reporter: Alexander Kasyanenko
> Priority: Major
>
> Hi, I'm trying to launch multiple jobs in Flink Session Cluster, deployed on
> mesos.
> Flink's version is 1.9.0.
> The very first resource allocation completes successfully, and first
> submitted job launches, but submitting any amount of jobs afterwords doesn't
> affect the cluster in any way and no additional TaskManagers are allocated.
> From the logs I see that MesosResourceManager is requesting Slots for the
> newly submitted jobs: "{{o.a.f.m.r.c.MesosResourceManager - Request slot
> with profile ResourceProfile..."}} but line {{"Starting a new worker.}}"
> appears in log only the same amount of times as taskmanagers count, allocated
> for the first job.
> I'm a complete noob in flink internals, but took a wild guess about a reason.
> I think that the problem is in this check:
> [https://github.com/apache/flink/blob/release-1.9.0/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosResourceManager.java#L436]
> It might be that RM is lazily allocated at the first call by a factory, and
> then a private final field {{slotsPerWorker}} is set. So this check will
> prevent creation of any new worker after iterator traverses the entire
> collection. My main assumption is that {{slotsPerWorker}} is never modified
> again.
>
> I'm sorry that I didn't do much of investigation before reporting, but I'll
> try to do some after a weekend. I plan to build flink without this check and
> see if it helps. Also I'll play around with tests for this RM. Since it's my
> time running time flink internals, I'll be back after a few days.
> Any help will much appreciated.
> Thanks in advance.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)