We have been trying to solve memory issue with a spark job that processes
150GB of data (on disk). It does a groupBy operation; some of the executor
will receive somehwere around (2-4M scala case objects) to work with. We
are using following spark config:

"executorInstances": "15",

     "executorCores": "1", (we reduce it to one so single task gets all the
executorMemory! at least that's the assumption here)

     "executorMemory": "15000m",

     "minPartitions": "2000",

     "taskCpus": "1",

     "executorMemoryOverhead": "1300",

     "shuffleManager": "tungsten-sort",

      "storageFraction": "0.4"


This is a snippet of what we see in spark UI for a Job that fails.

This is a *stage* of this job that fails.

Stage IdPool NameDescriptionSubmittedDurationTasks: Succeeded/TotalInput
OutputShuffle Read â–¾Shuffle WriteFailure Reason
5 (retry 15) prod
<http://hdn7:18080/history/application_1454975800192_0447/stages/pool?poolname=prod>
map
at SparkDataJobs.scala:210
<http://hdn7:18080/history/application_1454975800192_0447/stages/stage?id=5&attempt=15>
+details

2016/02/09 21:30:06 13 min
130/389 (16 failed)
1982.6 MB 818.7 MB org.apache.spark.shuffle.FetchFailedException: Error in
opening
FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/fasd/appcache/application_1454975800192_0447/blockmgr-abb77b52-9761-457a-b67d-42a15b975d76/0c/shuffle_0_39_0.data,
offset=11421300, length=2353}

This is one of the single *task* attempt from above stage that threw OOM

2 22361 0 FAILED PROCESS_LOCAL 38 / nd1.mycom.local 2016/02/09 22:10:42 5.2
min 1.6 min 7.4 MB / 375509 java.lang.OutOfMemoryError: Java heap space
+details

java.lang.OutOfMemoryError: Java heap space
        at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
        at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
        at 
org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:159)
        at 
org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:203)
        at 
org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:202)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at 
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:202)
        at 
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:186)
        at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:54)
        at 
org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
        at 
org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
        at 
org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:3


None of above suggest that it went out ot 15GB of memory that I initially
allocated? So what am i missing here. What's eating my memory.

We tried executorJavaOpts to get heap dump but it doesn't seem to work.

-XX:-HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -3 %p'
-XX:HeapDumpPath=/opt/cores/spark

I don't see any cores being generated.. neither I can find Heap dump
anywhere in logs.

Also, how do I find yarn container ID from spark executor ID ? So that I
can investigate yarn nodemanager and resourcemanager logs for particular
container.

PS - Job does not do any caching of intermediate RDD as each RDD is just
used once for subsequent step. We use spark 1.5.2 over Yarn in yarn-client
mode.


Thanks

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Reply via email to