I've got a similar question. Would you be able to provide some rough guide
(even a range is fine) on the number of nodes, cores, and total amount of
RAM required?

Do you want to store 1 TB, 1 PB or far more?

- say 6 TB of data in parquet format on s3


Do you want to just read that data, retrieve it then do little work on it
and then read it, have a complex machine learning pipeline?

- I need to 1) read it and do complex machine learning 2) query the last 3
months of data, visualise it, and come back with answers with seconds of
latency




On Sun, Apr 30, 2017 at 6:57 PM, yohann jardin <yohannjar...@hotmail.com>
wrote:

> It really depends on your needs and your data.
>
>
> Do you want to store 1 TB, 1 PB or far more? Do you want to just read that
> data, retrieve it then do little work on it and then read it, have a
> complex machine learning pipeline? Depending on the workload, the ratio
> between cores and storage will vary.
>
>
> First start with a subset of your data and do some tests on your own
> computer or (that’s better) with a little cluster of 3 nodes. This will
> help you to find your ratio between storage/cores and the needs of memory
> that you might expect if you are not using just a subset of your data but
> the whole bunch available that you (can) have.
>
>
> Then using this information and indications on Spark website (
> http://spark.apache.org/docs/latest/hardware-provisioning.html), you will
> be able to specify the hardware of one node, and how many nodes you need
> (at least 3).
>
>
> *Yohann Jardin*
> Le 4/30/2017 à 10:26 AM, rakesh sharma a écrit :
>
> Hi
>
> I would like to know the details of implementing a cluster.
>
> What kind of machines one would require, how many nodes, number of cores
> etc.
>
>
> thanks
>
> rakesh
>
>
>

Reply via email to