Andrey Pankov wrote:
It's a little bit expensive to have big cluster running for a long period, especially if you use EC2. So, as possible solution, we can start additional nodes and include them into cluster before running job, and then, after finishing, kill unused nodes.
As Ted has indicated, that should work. It won't be as fast as if you keep the entire cluster running the whole time, but it will be much cheaper.
An alternative is to store your persistent data in S3. Then you can shut down your cluster altogether when you're not computing. Your startup time each day will be slower, since reading from S3 is slower than reading from HDFS, so this may or may not be practical for you.
Doug
