Nishkam Ravi created SPARK-2398:
-----------------------------------
Summary: Trouble running Spark 1.0 on Yarn
Key: SPARK-2398
URL: https://issues.apache.org/jira/browse/SPARK-2398
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.0
Reporter: Nishkam Ravi
Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0.
For example: SparkPageRank when run in standalone mode goes through without any
errors (tested for up to 30GB input dataset on a 6-node cluster). Also runs
fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn cluster
mode) as the input data size is increased. Confirmed for 16GB input dataset.
The same workload runs fine with Spark 0.9 in both standalone and yarn cluster
mode (for up to 30 GB input dataset on a 6-node cluster).
Commandline used:
(/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn
--deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g
--driver-cores 16 --num-executors 5 --class
org.apache.spark.examples.SparkPageRank
/opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
pagerank_in $NUM_ITER)
pagerank.conf:
spark.master spark://c1704.halxg.cloudera.com:7077
spark.home /opt/cloudera/parcels/CDH/lib/spark
spark.executor.memory 32g
spark.default.parallelism 118
spark.cores.max 96
spark.storage.memoryFraction 0.6
spark.shuffle.memoryFraction 0.3
spark.shuffle.compress true
spark.shuffle.spill.compress true
spark.broadcast.compress true
spark.rdd.compress false
spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec
spark.io.compression.snappy.block.size 32768
spark.reducer.maxMbInFlight 48
spark.local.dir /var/lib/jenkins/workspace/tmp
spark.driver.memory 30g
spark.executor.cores 16
spark.locality.wait 6000
spark.executor.instances 5
UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection
to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
java.nio.channels.AsynchronousCloseException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
at
org.apache.spark.network.SendingConnection.write(Connection.scala:361)
at
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
--------
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
at
org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
at
org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
at
org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
at
org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
at
org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
-------
14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection to
a1016.halxg.cloudera.com/10.20.184.116:54105
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:313)
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
--
This message was sent by Atlassian JIRA
(v6.2#6252)