[jira] [Updated] (FLINK-17477) resumeConsumption call should happen as quickly as possible to minimise latency

Flink Jira Bot (Jira) Sun, 06 Jun 2021 15:48:09 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Flink Jira Bot updated FLINK-17477:
-----------------------------------
    Labels: stale-minor  (was: )

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help 
the community manage its development. I noticed that neither this issue nor its 
subtasks had updates for 180 days, so I labeled it "stale-minor".  If you are 
still affected by this bug or are still interested in this issue, please update 
and remove the label.


> resumeConsumption call should happen as quickly as possible to minimise 
> latency
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-17477
>                 URL: https://issues.apache.org/jira/browse/FLINK-17477
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing, Runtime / Network
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Piotr Nowojski
>            Priority: Minor
>              Labels: stale-minor
>
> We should be calling {{InputGate#resumeConsumption()}} as soon as possible 
> (to avoid any unnecessary delay/latency when task is idling). Currently I 
> think it’s mostly fine - the important bit is that on the happy path, we 
> always {{resumeConsumption}} before trying to complete the checkpoint, so 
> that netty threads will start resuming the network traffic while the task 
> thread is doing the synchronous part of the checkpoint and starting 
> asynchronous part. But I think in two places we are first aborting checkpoint 
> and only then resuming consumption (in {{CheckpointBarrierAligner}}):
> {code}
> // let the task know we are not completing this
> notifyAbort(currentCheckpointId,
>       new CheckpointException(
>               "Barrier id: " + barrierId,
>               CheckpointFailureReason.CHECKPOINT_DECLINED_SUBSUMED));
> // abort the current checkpoint
> releaseBlocksAndResetBarriers();
> {code}
> {code}
> // let the task know we skip a checkpoint
> notifyAbort(currentCheckpointId,
>               new 
> CheckpointException(CheckpointFailureReason.CHECKPOINT_DECLINED_INPUT_END_OF_STREAM));
> // no chance to complete this checkpoint
> releaseBlocksAndResetBarriers();
> {code}
> It’s not a big deal, as those are a rare conditions, but it would be better 
> to be consistent everywhere: first release blocks and resume consumption, 
> before anything else happens. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-17477) resumeConsumption call should happen as quickly as possible to minimise latency

Reply via email to