Hi, Bekir

First, The e2e time for a sub task is the $ack_time_received_in_JM -
$trigger_time_in_JM. And checkpoint includes some steps on task side such
as 1) receive first barrier; 2) barrier align(for exactly once); 3)
operator snapshot sync part; 4) operator snapshot async part, the images
you shared yesterday show that the sync part took a too long time, now the
sync part and async part took some time long, and e2e time is much longer
than sync_time + async_time.
1. you can checkpoint whether your job has backpressure
problems(backpressure may lead the barrier flows too slowly to the downside
task.), if it has such a problem, you should better solve it first.
2. If do not have a backpressure problem, you can check the `Alignment
Duration` to see if the barriers align took a too long time.
3. for sync part, maybe you can checkpoint the disk performance(if there
did not have the metric, you can find the `sar` log in your machine)
4. for the async part, we can check the network performance(or some client
network flow control)

Hope this can help you.

Best,
Congxian


Bekir Oguz <bekir.o...@persgroep.net> 于2019年7月18日周四 下午6:05写道:

> Hi Congxian,
> Starting from this morning we have more issues with checkpointing in
> production. What we see is sync and async duration for some subtasks are
> very long but what strange is the total of sync and async durations are
> much less than the total end to end duration. Please check the following
> snapshot:
>
>
> For example, for the subtask 14: Sync duration is 4 mins, async duration 3
> mins, end-to-end duration is 53 mins!!!
> We have a very long timeout value (1 hour) for checkpointing, but still
> many checkpoints are failing, some subtasks cannot finish checkpointing in
> 1 hour.
>
> We really appreciate your help here, this is a critical production problem
> for us at the moment.
>
> Regards,
> Bekir
>
>
> On 17 Jul 2019, at 17:46, Bekir Oguz <bekir.o...@persgroep.net> wrote:
>
>
> And I also extracted events fr
>
>
>

Reply via email to