[
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058298#comment-14058298
]
Guoqiang Li edited comment on SPARK-2398 at 7/11/14 3:16 AM:
-------------------------------------------------------------
Q1. {{-Xmx}} is only the heap space, {{native library}} and
{{sun.misc.Unsafe}} can easily allocate memory outside Java heap.
reference
http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory.
Q2. This is not a bug. We can disable this check by setting
{{yarn.nodemanager.pmem-check-enabled}} to {{false}} . Its default value is
{{true}}
in
[yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml]
was (Author: gq):
Q1. {{-Xmx}} is only the heap space, {{native library}} and
{{sun.misc.Unsafe}} can easily allocate memory outside Java heap.
reference
http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory.
Q2. This is not a bug. We can disable this check by setting
{{yarn.nodemanager.pmem-check-enabled}} to false. Its default value is
{{true}}
in
[yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml]
> Trouble running Spark 1.0 on Yarn
> ----------------------------------
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.0.0
> Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0.
> For example: SparkPageRank when run in standalone mode goes through without
> any errors (tested for up to 30GB input dataset on a 6-node cluster). Also
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn
> cluster mode) as the input data size is increased. Confirmed for 16GB input
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn
> --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g
> --driver-cores 16 --num-executors 5 --class
> org.apache.spark.examples.SparkPageRank
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
> pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.master spark://c1704.halxg.cloudera.com:7077
> spark.home /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory 32g
> spark.default.parallelism 118
> spark.cores.max 96
> spark.storage.memoryFraction 0.6
> spark.shuffle.memoryFraction 0.3
> spark.shuffle.compress true
> spark.shuffle.spill.compress true
> spark.broadcast.compress true
> spark.rdd.compress false
> spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size 32768
> spark.reducer.maxMbInFlight 48
> spark.local.dir /var/lib/jenkins/workspace/tmp
> spark.driver.memory 30g
> spark.executor.cores 16
> spark.locality.wait 6000
> spark.executor.instances 5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
> at
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
> at
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
> at
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> --------
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
> at
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
> at
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
> at
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
> at
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> -------
> 14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection
> to a1016.halxg.cloudera.com/10.20.184.116:54105
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at
> org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:313)
> at
> org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
--
This message was sent by Atlassian JIRA
(v6.2#6252)