Hi Ozan, From your description, it seems like your original huge job can be broken down into smaller disconnected graphs, with only some of the graphs requiring checkpointing / snapshots. In general, it would be a good practice to split disconnected graphs of the execution graph into multiple jobs, so that the checkpointing for each disconnected graph is coordinated independently.
I don’t expect a problem with 100+ jobs in one cluster, but it might be notable to keep in mind resource usage for bookkeeping and coordination in the JobManager. With Flink's current process model, the JobManager handles multiple jobs, so it is essentially a bottleneck to consider. There is already ongoing work with FLIP-6 to improve Flink’s process model, one of them being to change to a single JM per job. If you’re interested in it, you can check it out here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077 Cheers, Gordon On February 10, 2017 at 4:40:18 PM, Ozan DENİZ (ozande...@outlook.com) wrote: Hi everyone, We have a huge execution graph for one streaming job. To update this execution graph, we take the snapshot of the job and start the job with snapshot. However this can take too much time. One option is splitting this huge streaming job into smaller ones. We can cancel or run new stream jobs (without taking snapshot) instead updating the huge one I explained above. However we will end up having 100 - 150 small streaming jobs in one cluster. My question is; Is it a good practice to run multiple streaming jobs (above 100) in one cluster? Best, Ozan.