I've got a similar question. Would you be able to provide some rough guide (even a range is fine) on the number of nodes, cores, and total amount of RAM required?
Do you want to store 1 TB, 1 PB or far more? - say 6 TB of data in parquet format on s3 Do you want to just read that data, retrieve it then do little work on it and then read it, have a complex machine learning pipeline? - I need to 1) read it and do complex machine learning 2) query the last 3 months of data, visualise it, and come back with answers with seconds of latency On Sun, Apr 30, 2017 at 6:57 PM, yohann jardin <yohannjar...@hotmail.com> wrote: > It really depends on your needs and your data. > > > Do you want to store 1 TB, 1 PB or far more? Do you want to just read that > data, retrieve it then do little work on it and then read it, have a > complex machine learning pipeline? Depending on the workload, the ratio > between cores and storage will vary. > > > First start with a subset of your data and do some tests on your own > computer or (that’s better) with a little cluster of 3 nodes. This will > help you to find your ratio between storage/cores and the needs of memory > that you might expect if you are not using just a subset of your data but > the whole bunch available that you (can) have. > > > Then using this information and indications on Spark website ( > http://spark.apache.org/docs/latest/hardware-provisioning.html), you will > be able to specify the hardware of one node, and how many nodes you need > (at least 3). > > > *Yohann Jardin* > Le 4/30/2017 à 10:26 AM, rakesh sharma a écrit : > > Hi > > I would like to know the details of implementing a cluster. > > What kind of machines one would require, how many nodes, number of cores > etc. > > > thanks > > rakesh > > >