[ https://issues.apache.org/jira/browse/SPARK-32411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163742#comment-17163742 ]
L. C. Hsieh commented on SPARK-32411: ------------------------------------- I think it is because the configs. "spark.task.resource.gpu.amount 2" means each task requires 2 gpus, but "spark.executor.resource.gpu.amount 1" specifies each executor has only 1 gpu. So task scheduler cannot find an executor which meets task requirement. > GPU Cluster Fail > ---------------- > > Key: SPARK-32411 > URL: https://issues.apache.org/jira/browse/SPARK-32411 > Project: Spark > Issue Type: Bug > Components: PySpark, Web UI > Affects Versions: 3.0.0 > Environment: Ihave a Apache Spark 3.0 cluster consisting of machines > with multiple nvidia-gpus and I connect my jupyter notebook to the cluster > using pyspark, > Reporter: Vinh Tran > Priority: Major > > I'm having a difficult time getting a GPU cluster started on Apache Spark > 3.0. It was hard to find documentation on this, but I stumbled on a NVIDIA > github page for Rapids which suggested the following additional edits to the > spark-defaults.conf: > {code:java} > spark.task.resource.gpu.amount 0.25 > spark.executor.resource.gpu.discoveryScript > ./usr/local/spark/getGpusResources.sh{code} > I have a Apache Spark 3.0 cluster consisting of machines with multiple > nvidia-gpus and I connect my jupyter notebook to the cluster using pyspark, > however it results in the following error: > {code:java} > Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : org.apache.spark.SparkException: You must specify an amount for gpu > at > org.apache.spark.resource.ResourceUtils$.$anonfun$parseResourceRequest$1(ResourceUtils.scala:142) > at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119) > at > org.apache.spark.resource.ResourceUtils$.parseResourceRequest(ResourceUtils.scala:142) > at > org.apache.spark.resource.ResourceUtils$.$anonfun$parseAllResourceRequests$1(ResourceUtils.scala:159) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:159) > at > org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773) > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884) > at org.apache.spark.SparkContext.<init>(SparkContext.scala:528) > at > org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:238) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.lang.Thread.run(Thread.java:748) > {code} > After this, I then tried adding another line to the conf per the instructions > which results in no errors, however when I log in to the Web UI at > localhost:8080, under Running Applications, the state remains at waiting. > {code:java} > spark.task.resource.gpu.amount 2 > spark.executor.resource.gpu.discoveryScript > ./usr/local/spark/getGpusResources.sh > spark.executor.resource.gpu.amount 1 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org