Hi all,

I'm a beginner with spark, and I'm wondering if someone could provide
guidance on the following 2 questions I have.

Background: I have a data set growing by 6 TB p.a. I plan to use spark to
read in all the data, manipulate it and build a predictive model on it (say
GBM) I plan to store the data in S3, and use EMR to launch spark, reading
in data from S3.

1. Which option is best for storing the data on S3 for the purpose of
analysing it in EMR spark?
Option A: storing the 6TB file as 173 million individual text files
Option B: zipping up the above 173 million text files as 240,000 zip files
Option C: appending the individual text files, so have 240,000 text files
p.a.
Option D: combining the text files even further

2. Any recommendations on the EMR set up to analyse the 6TB of data all at
once and build a GBM, in terms of
1) The type of EC2 instances I need?
2) The number of such instances I need?
3) Rough estimate of cost?


Thanks so much,
Zeming

Reply via email to