[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058552#comment-14058552
 ] 

Sean Owen commented on SPARK-2398:
----------------------------------

[~gq] This does not really have to do with allocating memory off heap per se. 
Your second reply is closer.
[~nravi] If you tell a Java process it can use 16GB of memory, and tell YARN 
the container can use 16GB of memory, then it will get killed at some point 
since the JVM's physical memory footprint will certainly go beyond 16GB. This 
is just how Java and YARN work. 

I suspect your cluster config is actually different. There are several YARN 
configurations that matter here, mostly, the max memory that a container can 
request. Yes, spark.yarn.executor.memoryOverhead could be increased to give 
more room, but I don't even know that this is the issue.

How big is the YARN container vs your heap size?

> Trouble running Spark 1.0 on Yarn 
> ----------------------------------
>
>                 Key: SPARK-2398
>                 URL: https://issues.apache.org/jira/browse/SPARK-2398
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
> --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
> --driver-cores 16 --num-executors 5 --class 
> org.apache.spark.examples.SparkPageRank 
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
>  pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.master            spark://c1704.halxg.cloudera.com:7077
> spark.home      /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism       118
> spark.cores.max 96
> spark.storage.memoryFraction    0.6
> spark.shuffle.memoryFraction    0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compress    true
> spark.broadcast.compress        true
> spark.rdd.compress      false
> spark.io.compression.codec      org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight     48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory     30g
> spark.executor.cores    16
> spark.locality.wait     6000
> spark.executor.instances        5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
>         at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
>         at 
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
>         at 
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> --------
> java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
>         at java.io.FilterInputStream.close(FilterInputStream.java:181)
>         at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
>         at 
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
>         at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
>         at 
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
>         at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
>         at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at 
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
>         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> -------
> 14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection 
> to a1016.halxg.cloudera.com/10.20.184.116:54105
> java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>         at 
> org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:313)
>         at 
> org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to