[
https://issues.apache.org/jira/browse/FLINK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559951#comment-14559951
]
ASF GitHub Bot commented on FLINK-1952:
---------------------------------------
GitHub user StephanEwen opened a pull request:
https://github.com/apache/flink/pull/731
[FLINK-1952] [jobmanager] Rework and fix slot sharing scheduler
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/StephanEwen/incubator-flink slots_fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/731.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #731
----
commit 771437360662bcf105c72d95924a5ce3c69f1585
Author: Stephan Ewen <[email protected]>
Date: 2015-05-19T17:08:25Z
Add big not so mini cluster test for CC to provoke scheduler problem
commit 067c3868c07ea125d8f429e38476d3e8edfbad08
Author: Stephan Ewen <[email protected]>
Date: 2015-05-20T09:37:56Z
[FLINK-1952] [jobmanager] Rework and fix slot sharing scheduler
commit 88074eda5c99945d0c0f106240010a451ba41658
Author: Stephan Ewen <[email protected]>
Date: 2015-05-26T20:56:50Z
[tests] Fix AvroExternalJarProgramITCase logging
----
> Cannot run ConnectedComponents example: Could not allocate a slot on instance
> -----------------------------------------------------------------------------
>
> Key: FLINK-1952
> URL: https://issues.apache.org/jira/browse/FLINK-1952
> Project: Flink
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 0.9
> Reporter: Robert Metzger
> Priority: Blocker
>
> Steps to reproduce
> {code}
> ./bin/yarn-session.sh -n 350
> {code}
> ... wait until they are connected ...
> {code}
> Number of connected TaskManagers changed to 266. Slots available: 266
> Number of connected TaskManagers changed to 323. Slots available: 323
> Number of connected TaskManagers changed to 334. Slots available: 334
> Number of connected TaskManagers changed to 343. Slots available: 343
> Number of connected TaskManagers changed to 350. Slots available: 350
> {code}
> Start CC
> {code}
> ./bin/flink run -p 350
> ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar
> {code}
> ---> it runs
> Run KMeans, let it fail with
> {code}
> Failed to deploy the task Map (Map at main(KMeans.java:100)) (1/350) -
> execution #0 to slot SimpleSlot (2)(2)(0) - 182b7661ca9547a84591de940c47a200
> - ALLOCATED/ALIVE: java.io.IOException: Insufficient number of network
> buffers: required 350, but only 254 available. The total number of network
> buffers is currently set to 2048. You can increase this number by setting the
> configuration key 'taskmanager.network.numberOfBuffers'.
> {code}
> ... as expected.
> (I've waited for 10 minutes between the two submissions)
> Starting CC now will fail:
> {code}
> ./bin/flink run -p 350
> ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar
> {code}
> Error message(s):
> {code}
> Caused by: java.lang.IllegalStateException: Could not schedule consumer
> vertex IterationHead(WorksetIteration (Unnamed Delta Iteration)) (19/350)
> at
> org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:479)
> at
> org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:469)
> at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:94)
> at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> at
> scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
> ... 4 more
> Caused by:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate a slot on instance 4a6d761cb084c32310ece1f849556faf @
> cloud-19 - 1 slots - URL:
> akka.tcp://[email protected]:51400/user/taskmanager, as required by the
> co-location constraint.
> at
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:247)
> at
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:110)
> at
> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:262)
> at
> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:436)
> at
> org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:475)
> ... 9 more
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)