Hi Ozan,

From your description, it seems like your original huge job can be broken down 
into smaller disconnected graphs, with only some of the graphs requiring 
checkpointing / snapshots.
In general, it would be a good practice to split disconnected graphs of the 
execution graph into multiple jobs, so that the checkpointing for each 
disconnected graph is coordinated independently.

I don’t expect a problem with 100+ jobs in one cluster, but it might be notable 
to keep in mind resource usage for bookkeeping and coordination in the 
JobManager. With Flink's current process model, the JobManager handles multiple 
jobs, so it is essentially a bottleneck to consider. There is already ongoing 
work with FLIP-6 to improve Flink’s process model, one of them being to change 
to a single JM per job. If you’re interested in it, you can check it out here: 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077

Cheers,
Gordon


On February 10, 2017 at 4:40:18 PM, Ozan DENİZ (ozande...@outlook.com) wrote:

Hi everyone,  


We have a huge execution graph for one streaming job. To update this execution 
graph, we take the snapshot of the job and start the job with snapshot. However 
this can take too much time.  


One option is splitting this huge streaming job into smaller ones. We can 
cancel or run new stream jobs (without taking snapshot) instead updating the 
huge one I explained above. However we will end up having 100 - 150 small 
streaming jobs in one cluster.  


My question is;  

Is it a good practice to run multiple streaming jobs (above 100) in one 
cluster?  


Best,  


Ozan.  

Reply via email to