You might want to look at Nephele: Efficient Parallel Data Processing in the Cloud, Warneke & Kao, 2009
http://stratosphere.eu/assets/papers/Nephele_09.pdf This was some of the work done in the research project with gave birth to Flink, though this bit didn't surface as they chose to leave VM allocation to others. essentially: the query planner could track allocations and lifespans of work, know that if a VM were to be released, pick the one closest to its our being up, let you choose between fast but expensive vs slow but (maybe) less expensive, etc, etc. It's a complex problem, as to do it you need to think about more than just spot load, more "how to efficiently divide work amongst a pool of machines with different lifespans" what could be good to look at today would be rather than hard code the logic -provide metrics information which higher level tools could use to make decisions/send hints down -maybe schedule things to best support pre-emptible nodes in the cluster; the ones where you bid spot prices for from EC2, get 1 hour guaranteed, then after they can be killed without warning. preemption-aware scheduling might imply making sure that any critical information is kept out the preemptible nodes, or at least replicated onto a long-lived one, and have stuff in the controller ready to react to unannounced pre-emption. FWIW when YARN preempts you do get notified, and maybe even some very early warning. I don't know if spark uses that. There is some support in HDFS for declaring that some nodes have interdependent failures, "failure domains", so you could use that to have HDFS handle replication and only store 1 copy on preemptible VMs, leaving only the scheduling and recovery problem. Finally, YARN container resizing: lets you ask for more resources when busy, release them when idle. This may be good for CPU load, though memory management isn't something programs can ever handle On 2 Feb 2017, at 19:05, Gabi Cristache <gabi.crista...@gmail.com<mailto:gabi.crista...@gmail.com>> wrote: Hello, My name is Gabriel Cristache and I am a student in my final year of a Computer Engineering/Science University. I want for my Bachelor Thesis to add support for dynamic scaling to a spark streaming application. The goal of the project is to develop an algorithm that automatically scales the cluster up and down based on the volume of data processed by the application. You will need to balance between quick reaction to traffic spikes (scale up) and avoiding wasted resources (scale down) by implementing something along the lines of a PID algorithm. Do you think this is feasible? And if so are there any hints that you could give me that would help my objective? Thanks, Gabriel Cristache