[
https://issues.apache.org/jira/browse/SYSTEMML-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534269#comment-15534269
]
Matthias Boehm edited comment on SYSTEMML-994 at 9/29/16 10:41 PM:
-------------------------------------------------------------------
couple of comments:
(1) please do not specify the default parallelism - taking the number of input
partitions or total number of cores for parallelize (both are the defaults)
usually works very well,
(2) in my experiments, the simple configuration of 1 executor per node with
large memory actually works the best (which has advantages in terms of resource
consumption as, for example, broadcasts can be shared in memory)
(3) over-committing CPU is usually a good idea too, so there is no need to
spare a core for the OS.
(4) since you are running standalone, you might want to explicitly specify
'spark local dirs'
(5) make sure the driver runs on a single node without anything on it - I've
seen huge scheduler delays (that affect the entire cluster), if an executor
runs on the same node.
(6) regarding the GC issue, keep in mind that is is NOT a OOM but simply an
executed GC limit that indicates unnecessary object allocations, etc, please
try to set the -Xmn to 10% of the max heap, if this does not help please change
the GC limit.
Apart from the configuration issue, it is certainly useful to also do another
pass over the matrix to frame converter to eliminate any remaining
inefficiencies.
was (Author: mboehm7):
couple of comments:
(1) please do not specify the default parallelism - taking the number of input
partitions or total number of cores for parallelize (both are the defaults)
usually works very well,
(2) in my experiments, the simple configuration of 1 executor per node with
large memory actually works the best (which has advantages in terms of resource
consumption as, for example, broadcasts can be shared in memory)
(3) over-committing CPU is usually a good idea too, so there is no need to
spare a core for the OS.
(4) since you are running standalone, you might want to explicitly specify
'spark local dirs'
(5) make sure the driver runs on a single node without anything on it - I've
seen huge scheduler delays (that affect the entire cluster), if an executor
runs on the same node.
(6) I usually run both the driver and executors with '-server' flag which has
different JIT and GC configurations that really pay off for the long running
executors.
> GC OOM: Binary Matrix to Frame Conversion
> -----------------------------------------
>
> Key: SYSTEMML-994
> URL: https://issues.apache.org/jira/browse/SYSTEMML-994
> Project: SystemML
> Issue Type: Bug
> Reporter: Mike Dusenberry
> Priority: Blocker
>
> I currently have a SystemML matrix saved to HDFS in binary block format, and
> am attempting to read it in, convert it to a {{frame}}, and then pass that to
> an algorithm so that I can pull batches out of it with minimal overhead.
> When attempting to run this, I am repeatedly hitting the following GC limit:
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at
> org.apache.sysml.runtime.matrix.data.FrameBlock.ensureAllocatedColumns(FrameBlock.java:281)
> at
> org.apache.sysml.runtime.matrix.data.FrameBlock.copy(FrameBlock.java:979)
> at
> org.apache.sysml.runtime.matrix.data.FrameBlock.copy(FrameBlock.java:965)
> at
> org.apache.sysml.runtime.matrix.data.FrameBlock.<init>(FrameBlock.java:91)
> at
> org.apache.sysml.runtime.instructions.spark.utils.FrameRDDAggregateUtils$CreateBlockCombinerFunction.call(FrameRDDAggregateUtils.java:57)
> at
> org.apache.sysml.runtime.instructions.spark.utils.FrameRDDAggregateUtils$CreateBlockCombinerFunction.call(FrameRDDAggregateUtils.java:48)
> at
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1015)
> at
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:187)
> at
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:186)
> at
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:148)
> at
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
> at
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
> at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Script:
> {code}
> train = read("train")
> val = read("val")
> trainf = as.frame(train)
> valf = as.frame(val)
> // Rest of algorithm, which passes the frames to DML functions, and performs
> row indexing to pull out batches, convert to matrices, and train.
> {code}
> Cluster setup:
> * Spark Standalone
> * 1 Master, 9 Workers
> * 47 cores, 124 GB available to Spark on each Worker (1 core + 1GB saved for
> OS)
> * spark.driver.memory 80g
> * spark.executor.memory 21g
> * spark.executor.cores 3
> * spark.default.parallelism 20000
> * spark.driver.maxResultSize 0
> * spark.akka.frameSize 128
> * spark.network.timeout 1000s
> Note: This is using today's latest build as of 09.29.16 1:30PM PST.
> cc [~mboehm7], [~acs_s]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)