Hi,

Thank you so much for taking time to answer my questions and pointing me to
relevant documentation. Really appreciate it.

When the task failover happens, are there internal metrics in Flink at a
job level to track the new execution attempt?  Is there a way for the
application owner to figure out how many task failovers have happened in a
job execution and get the current execution attempt.

Thanks.

On Mon, Mar 27, 2023 at 2:55 AM Weihua Hu <huweihua....@gmail.com> wrote:

> Hi,
>
> 1. Does this mean that  each  task slot will contain an entire pipeline in
> > the job?
>
> not exactly, each slot will run a subtask of each task. If the job is so
> simple that
> there is no keyby logic and we do not enable rebalance shuffle type, each
> slot
> could run all the pipeline. But if not we need to shuffle data to other
> subtasks.
> You can get some examples from [1].
>
> 2. Upon a TM pod failure and after K8s brings back the TM pod, would flink
> > assign the same subtasks back to restarted TM  again? Or will they be
> > distributed to different TaskManagers?
>
> If there is no shuffle data in your job (described in 1), only tasks on
> failure pods
>  will be restarted, and they will be assigned to the new TM again.
> But if not, all the related tasks will be restarted. When these tasks
> re-scheduled,
> there are some strategy to assign slots. They will try to assign the task
> to previous
> slot to reduce the recovery time, But there is no guarantee.
> You can read [2] to get more information about failure recovery.
>
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#tasks-and-operator-chains
> [2]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/
>
> Best,
> Weihua
>
>
> On Mon, Mar 27, 2023 at 3:22 PM santhosh venkat <
> santhoshvenkat1...@gmail.com> wrote:
>
> > Hi,
> >
> > I am trying to understand how subtask distribution works in Flink. Let's
> > assume a setup of a Flink cluster with a fixed number of TaskManagers in
> a
> > kubernetes cluster.
> >
> > Let's say I have a flink job with all the operators having the same
> > parallelism and with the same Slot sharing group. The operator
> parallelism
> > is computed as the number of task managers multiplied by number of task
> > slots per TM.
> >
> > 1. Does this mean that  each  task slot will contain an entire pipeline
> in
> > the job?
> > 2. Upon a TM pod failure and after K8s brings back the TM pod, would
> flink
> > assign the same subtasks back to restarted TM  again? Or will they be
> > distributed to different TaskManagers?
> >
> > It would be great if someone can answer this question.
> >
> > Thanks.
> >
>

Reply via email to