I am at crossroads now and expert advise help me decide what the next
course of the project going to be.

Background : At out company we process tons of data to help build
experimentation platform. We fire more than 300s of M/R jobs, Peta bytes of
data, takes 24 hours and does lots of joins. Its simply stupendously
complex.

POC: Migrate a small portion of processing to Spark and aim to achieve 10x
gains. Today this processing on M/R world takes 2.5 to 3 Hours.

Data Sources: 3 (All on HDFS).
Format: Two in Sequence File and one in Avro
Data Size:
1)  64 files      169,380,175,136 bytes- Sequence
2) 101 files        84,957,259,664 bytes- Avro
3) 744 files       1,972,781,123,924 bytes- Sequence

Process
A) Map Side Join of #1 and #2
B) Left Outer Join of A) and #3
C) Reduce By Key of B)
D) Map Only processing of C.

Optimizations
1) Converted Equi-Join to Map-Side  (Broadcast variables ) Join #A.
2) Converted groupBy + Map => ReduceBy Key #C.

I have a huge YARN (Hadoop 2.4.x) cluster at my disposal but I am limited
to use only 12G on each node.

1) My poc (after a month of crazy research, lots of Q&A on this amazing
forum) runs fine with 1 file each from above data sets and finishes in 10
mins taking 4 executors. I started with 60 mins and got it down to 10 mins.
2) For 5 files each data set it takes 45 mins and 16 executors.
3) When i run against 10 files, it fails repeatedly with OOM and several
timeout errors.
Configs:  --num-executors 96 --driver-memory 12g --driver-java-options
"-XX:MaxPermSize=10G" --executor-memory 12g --executor-cores 4, Spark 1.3.1


Expert Advice
My goal is simple to be able to complete the processing at 10x to 100x
speed than M/R or show its not possible with Spark.

*A) 10x to 100x*
1) What will it take in terms of # of executors, # of executor-cores ? &
amount of memory on each executor and some unknown magic settings that am
suppose to do to reach this goal ?
2) I am attaching the code for review that can further speed up processing,
if at all its possible ?
3) Do i need to do something else ?

*B) Give up and wait for next amazing tech to come up*
Given the steps that i have performed so far, should i conclude that its
not possible to achieve 10x to 100x gains and am stuck with M/R world for
now.

I am in need of help here. I am available for discussion at any time
(day/night).

Hope i provided all the details.
Regards,
Deepak

Attachment: VISummaryDataProvider.scala
Description: Binary data

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to