Really not expert here, but try the following ideas:
1) I assume you are using yarn, then this blog is very good about the resource 
tuning: 
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
2) If 12G is a hard limit in this case, then you have no option but lower your 
concurrency. Try starting set "--executor-cores=1" as first step, this will 
force each executor running with one task a time. This is worst efficient for 
your job, but try to see if your application can be finished without OOM.
3) Add more partitions for your RDD. For a given RDD, larger partitions means 
each partition will contain less data, which requires less memory to process 
them, and if each one processed by 1 core in each executor, that means you 
almost lower your memory requirement for executor to the lowest level.
4) Do you cache data? Don't cache them for now, and lower 
"spark.storage.memoryFraction", so less memory preserved for cache.
Since your top priority is to avoid OOM, all the above steps will make the job 
run slower, or less efficient. In any case, first you should check your code 
logic, to see if there could be with any improvement, but we assume your code 
is already optimized, as in your email. If the above steps still cannot help 
your OOM, then maybe your data for one partition just cannot fit with 12G heap, 
based on the logic you try to do in your code.
Yong
From: deepuj...@gmail.com
Date: Thu, 30 Apr 2015 18:48:12 +0530
Subject: Expert advise needed. (POC is at crossroads)
To: user@spark.apache.org

I am at crossroads now and expert advise help me decide what the next course of 
the project going to be.
Background : At out company we process tons of data to help build 
experimentation platform. We fire more than 300s of M/R jobs, Peta bytes of 
data, takes 24 hours and does lots of joins. Its simply stupendously complex. 
POC: Migrate a small portion of processing to Spark and aim to achieve 10x 
gains. Today this processing on M/R world takes 2.5 to 3 Hours. 
Data Sources: 3 (All on HDFS). Format: Two in Sequence File and one in AvroData 
Size:1)  64 files      169,380,175,136 bytes- Sequence







2) 101 files        84,957,259,664 bytes- Avro3) 744 files       
1,972,781,123,924 bytes- Sequence
ProcessA) Map Side Join of #1 and #2B) Left Outer Join of A) and #3C) Reduce By 
Key of B)D) Map Only processing of C.
Optimizations1) Converted Equi-Join to Map-Side  (Broadcast variables ) Join 
#A.2) Converted groupBy + Map => ReduceBy Key #C.
I have a huge YARN (Hadoop 2.4.x) cluster at my disposal but I am limited to 
use only 12G on each node.
1) My poc (after a month of crazy research, lots of Q&A on this amazing forum) 
runs fine with 1 file each from above data sets and finishes in 10 mins taking 
4 executors. I started with 60 mins and got it down to 10 mins.2) For 5 files 
each data set it takes 45 mins and 16 executors.3) When i run against 10 files, 
it fails repeatedly with OOM and several timeout errors.Configs:  
--num-executors 96 --driver-memory 12g --driver-java-options 
"-XX:MaxPermSize=10G" --executor-memory 12g --executor-cores 4, Spark 1.3.1









Expert AdviceMy goal is simple to be able to complete the processing at 10x to 
100x speed than M/R or show its not possible with Spark.
A) 10x to 100x1) What will it take in terms of # of executors, # of 
executor-cores ? & amount of memory on each executor and some unknown magic 
settings that am suppose to do to reach this goal ?2) I am attaching the code 
for review that can further speed up processing, if at all its possible ?3) Do 
i need to do something else ?
B) Give up and wait for next amazing tech to come upGiven the steps that i have 
performed so far, should i conclude that its not possible to achieve 10x to 
100x gains and am stuck with M/R world for now.
I am in need of help here. I am available for discussion at any time 
(day/night).
Hope i provided all the details.Regards,
Deepak



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org                     
                  

Reply via email to