[
https://issues.apache.org/jira/browse/FLINK-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann closed FLINK-12245.
---------------------------------
Resolution: Duplicate
Thanks for reporting this issue [~hwanju]. I think the problem should be fixed
with Flink 1.6.3. See FLINK-9635 for more details. I'll close this issue as a
duplicate. If the problem still occurs, then please reopen this issue.
> Transient slot allocation failure on job recovery
> -------------------------------------------------
>
> Key: FLINK-12245
> URL: https://issues.apache.org/jira/browse/FLINK-12245
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.6.3
> Environment: Flink 1.6.2 with Kubernetes
> Reporter: Hwanju Kim
> Priority: Major
>
> In 1.6.2, We have experienced that slot allocation is transiently failed on
> job recovery especially when task manager (TM) is unavailable leading to
> heartbeat failure. By transient, it means it fails once with slot allocation
> timeout (by default 5min) and then next recovering restart is succeeded.
>
> I found that each _Execution_ remembers previous allocations and tries to
> prefer the last previous allocation for the sake of local state recovery from
> the resolved slot candidates. If the previous allocation belongs to
> unavailable TM, the candidates do not have this previous allocation, thereby
> forcing slot provider to request a new slot to resource manager, which then
> finds a new TM and its available slots. So far it is expected and fine, but
> any next execution that also belonged to the unavailable TM and has the first
> task as predecessor fails with the unavailable previous allocation as well.
> Here it also requests another new slot since it never finds the gone previous
> allocation from candidates. However, this behavior may make more slot
> requests than available. For example, if two pipelined tasks shared one slot
> in one TM, which is then crashed being replaced with a new TM, two new slot
> requests are generated from the tasks. Since two slot requests cannot be
> fulfilled by one slot TM, it hits slot allocation timeout and restarts the
> job.
>
> {code:java}
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate all requires slots within timeout of 300000 ms. Slots
> required: 2, slots allocated: 1 {code}
>
> At the next round of recovery, since the second execution failed to allocate
> a new slot, its last previous allocation is _null_, then it falls back to
> locality-based allocation strategy, which can find the slot allocated for the
> first task, and thus succeeded. Although it is eventually succeeded, it
> increases downtime by slot allocation timeout.
>
> The reason of this behavior is
> _PreviousAllocationSchedulingStrategy.findMatchWithLocality()_ immediately
> returns _null_ if previous allocation is not empty and is not contained in
> candidate list. I thought that if previous allocation is not in the
> candidates, it can fall back to
> _LocationPreferenceSchedulingStrategy.findMatchWithLocality()_ rather than
> returning _null_. By doing so, it can avoid requesting more than available.
> Requesting more slots could be fine in an environment where resource managers
> can reactively spawn up more TMs (like Yarn/Mesos) although it could spawn
> more than needed, but StandaloneResourceManager with statically provisioned
> resource cannot help but failing to allocate requested slots.
>
> Having looked at the mainline branch and 1.8.0, although I have not attempted
> to reproduce this issue with mainline, the related code is changed to what I
> have expected (falling back to locality-based strategy if previous allocation
> is not in candidates):
> PreviousAllocationSlotSelectionStrategy.selectBestSlotForProfile(). Those led
> me to reading group-aware scheduling work
> ([https://docs.google.com/document/d/1q7NOqt05HIN-PlKEEPB36JiuU1Iu9fnxxVGJzylhsxU/edit#heading=h.k15nfgsa5bnk]).
> In addition, I checked in 1.6.2 _matchPreviousLocationNotAvailable_ test
> expects the problematic behavior I described. So, I started wondering whether
> the behavior of previous allocation strategy in non-mainline is by design or
> not. I have a fix similar to the mainline and verified that the problem is
> resolved, but I am bringing up the issue to have context around the behavior
> and to discuss what would be the side-effect of the fix. I understand the
> current vertex-by-vertex scheduling would be inefficient by letting an
> execution that belonged to unavailable slot steal another task's previous
> slot, but having slot allocation failure seems worse to me.
>
> I searched with slot allocation failure term in existing issues, and couldn't
> find the same issue, hence this issue. Please feel free to deduplicate it if
> any.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)