I am at crossroads now and expert advise help me decide what the next course of the project going to be.
Background : At out company we process tons of data to help build experimentation platform. We fire more than 300s of M/R jobs, Peta bytes of data, takes 24 hours and does lots of joins. Its simply stupendously complex. POC: Migrate a small portion of processing to Spark and aim to achieve 10x gains. Today this processing on M/R world takes 2.5 to 3 Hours. Data Sources: 3 (All on HDFS). Format: Two in Sequence File and one in Avro Data Size: 1) 64 files 169,380,175,136 bytes- Sequence 2) 101 files 84,957,259,664 bytes- Avro 3) 744 files 1,972,781,123,924 bytes- Sequence Process A) Map Side Join of #1 and #2 B) Left Outer Join of A) and #3 C) Reduce By Key of B) D) Map Only processing of C. Optimizations 1) Converted Equi-Join to Map-Side (Broadcast variables ) Join #A. 2) Converted groupBy + Map => ReduceBy Key #C. I have a huge YARN (Hadoop 2.4.x) cluster at my disposal but I am limited to use only 12G on each node. 1) My poc (after a month of crazy research, lots of Q&A on this amazing forum) runs fine with 1 file each from above data sets and finishes in 10 mins taking 4 executors. I started with 60 mins and got it down to 10 mins. 2) For 5 files each data set it takes 45 mins and 16 executors. 3) When i run against 10 files, it fails repeatedly with OOM and several timeout errors. Configs: --num-executors 96 --driver-memory 12g --driver-java-options "-XX:MaxPermSize=10G" --executor-memory 12g --executor-cores 4, Spark 1.3.1 Expert Advice My goal is simple to be able to complete the processing at 10x to 100x speed than M/R or show its not possible with Spark. *A) 10x to 100x* 1) What will it take in terms of # of executors, # of executor-cores ? & amount of memory on each executor and some unknown magic settings that am suppose to do to reach this goal ? 2) I am attaching the code for review that can further speed up processing, if at all its possible ? 3) Do i need to do something else ? *B) Give up and wait for next amazing tech to come up* Given the steps that i have performed so far, should i conclude that its not possible to achieve 10x to 100x gains and am stuck with M/R world for now. I am in need of help here. I am available for discussion at any time (day/night). Hope i provided all the details. Regards, Deepak
VISummaryDataProvider.scala
Description: Binary data
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org