Re: Apache Spark Contribution

2017-02-03 Thread Steve Loughran
You might want to look at Nephele: Efficient Parallel Data Processing in the 
Cloud, Warneke & Kao, 2009

http://stratosphere.eu/assets/papers/Nephele_09.pdf

This was some of the work done in the research project with gave birth to 
Flink, though this bit didn't surface as they chose to leave VM allocation to 
others.

essentially: the query planner could track allocations and lifespans of work, 
know that if a VM were to be released, pick the one closest to its our being 
up, let you choose between fast but expensive vs slow but (maybe) less 
expensive, etc, etc.

It's a complex problem, as to do it you need to think about more than just spot 
load, more "how to efficiently divide work amongst a pool of machines with 
different lifespans"

what could be good to look at today would be rather than hard code the logic

-provide metrics information which higher level tools could use to make 
decisions/send hints down
-maybe schedule things to best support pre-emptible nodes in the cluster; the 
ones where you bid spot prices for from EC2, get 1 hour guaranteed, then after 
they can be killed without warning.

preemption-aware scheduling might imply making sure that any critical 
information is kept out the preemptible nodes, or at least replicated onto a 
long-lived one, and have stuff in the controller ready to react to unannounced 
pre-emption. FWIW when YARN preempts you do get notified, and maybe even some 
very early warning. I don't know if spark uses that.

There is some support in HDFS for declaring that some nodes have interdependent 
failures, "failure domains", so you could use that to have HDFS handle 
replication and only store 1 copy on preemptible VMs, leaving only the 
scheduling and recovery problem.

Finally, YARN container resizing: lets you ask for more resources when busy, 
release them when idle. This may be good for CPU load, though memory management 
isn't something programs can ever handle

On 2 Feb 2017, at 19:05, Gabi Cristache 
> wrote:

Hello,

My name is Gabriel Cristache and I am a student in my final year of a Computer 
Engineering/Science University. I want for my Bachelor Thesis to add support 
for dynamic scaling to a spark streaming application.

The goal of the project is to develop an algorithm that automatically scales 
the cluster up and down based on the volume of data processed by the 
application.
You will need to balance between quick reaction to traffic spikes (scale up) 
and avoiding wasted resources (scale down) by implementing something along the 
lines of a PID algorithm.


 Do you think this is feasible? And if so are there any hints that you could 
give me that would help my objective?

Thanks,
Gabriel Cristache



Apache Spark Contribution

2017-02-02 Thread Gabi Cristache
Hello,

My name is Gabriel Cristache and I am a student in my final year of a
Computer Engineering/Science University. I want for my Bachelor Thesis to
add support for dynamic scaling to a spark streaming application.


*The goal of the project is to develop an algorithm that automatically
scales the cluster up and down based on the volume of data processed by the
application.*

*You will need to balance between quick reaction to traffic spikes (scale
up) and avoiding wasted resources (scale down) by implementing something
along the lines of a PID algorithm.*



 Do you think this is feasible? And if so are there any hints that you
could give me that would help my objective?


Thanks,

Gabriel Cristache