[DISCUSS] Future of Per-Job Mode

Konstantin Knauf Thu, 13 Jan 2022 00:32:10 -0800

Hi everyone,

I would like to discuss and understand if the benefits of having Per-Job
Mode in Apache Flink outweigh its drawbacks.



*# Background: Flink's Deployment Modes*
Flink currently has three deployment modes. They differ in the following
dimensions:
* main() method executed on Jobmanager or Client
* dependencies shipped by client or bundled with all nodes
* number of jobs per cluster & relationship between job and cluster
lifecycle* (supported resource providers)

## Application Mode
* main() method executed on Jobmanager
* dependencies already need to be available on all nodes
* dedicated cluster for all jobs executed from the same main()-method
(Note: applications with more than one job, currently still significant
limitations like missing high-availability). Technically, a session cluster
dedicated to all jobs submitted from the same main() method.
* supported by standalone, native kubernetes, YARN

## Session Mode
* main() method executed in client
* dependencies are distributed from and by the client to all nodes
* cluster is shared by multiple jobs submitted from different clients,
independent lifecycle
* supported by standalone, Native Kubernetes, YARN

## Per-Job Mode
* main() method executed in client
* dependencies are distributed from and by the client to all nodes
* dedicated cluster for a single job
* supported by YARN only


*# Reasons to Keep** There are use cases where you might need the
combination of a single job per cluster, but main() method execution in the
client. This combination is only supported by per-job mode.
* It currently exists. Existing users will need to migrate to either
session or application mode.


*# Reasons to Drop** With Per-Job Mode and Application Mode we have two
modes that for most users probably do the same thing. Specifically, for
those users that don't care where the main() method is executed and want to
submit a single job per cluster. Having two ways to do the same thing is
confusing.
* Per-Job Mode is only supported by YARN anyway. If we keep it, we should
work towards support in Kubernetes and Standalone, too, to reduce special
casing.
* Dropping per-job mode would reduce complexity in the code and allow us to
dedicate more resources to the other two deployment modes.
* I believe with session mode and application mode we have to easily
distinguishable and understandable deployment modes that cover Flink's use
cases:
   * session mode: olap-style, interactive jobs/queries, short lived batch
jobs, very small jobs, traditional cluster-centric deployment mode (fits
the "Hadoop world")
   * application mode: long-running streaming jobs, large scale &
heterogenous jobs (resource isolation!), application-centric deployment
mode (fits the "Kubernetes world")


*# Call to Action*
* Do you use per-job mode? If so, why & would you be able to migrate to one
of the other methods?
* Am I missing any pros/cons?
* Are you in favor of dropping per-job mode midterm?

Cheers and thank you,

Konstantin

-- 

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk

[DISCUSS] Future of Per-Job Mode

Reply via email to