[
https://issues.apache.org/jira/browse/FLINK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weijie Guo closed FLINK-31330.
------------------------------
Resolution: Not A Problem
After some investigation and debugging, it is finally found that the problem
has been fixed in FLINK-16536. After that, request partition will comply with
the priority of input.
As a result, it is easy to observe that the time of the initialization state
becomes very long. Because the {{InputGate}} with the lowest priority will
start to restore the channel state and then turn to the running state only
after all other inputs are finished. A possible optimization approach is to
start the state transition after the channel state data is loaded, rather than
being processed.
> Batch shuffle may deadlock for operator with priority input
> -----------------------------------------------------------
>
> Key: FLINK-31330
> URL: https://issues.apache.org/jira/browse/FLINK-31330
> Project: Flink
> Issue Type: Technical Debt
> Components: Runtime / Network
> Affects Versions: 1.16.1
> Reporter: Weijie Guo
> Assignee: Weijie Guo
> Priority: Major
>
> For batch job, some operator's input have priority. For example, hash join
> operator has two inputs called {{build}} and {{probe}} respectively. Only
> after the build input is finished can the probe input start consuming.
> Unfortunately, the priority of input will not affect multiple inputs to
> request upstream data(i.e. request partition). In current implementation,
> when all states are restored, inputGate will start to request partition. This
> will enable the upstream {{IO scheduler}} to register readers for all
> downstream channels, so there is the possibility of deadlock.
> Assume that the build and probe input's upstream tasks of hash join are
> deployed in the same TM. Then the corresponding readers will be registered to
> an single {{IO scheduler}}, and they share the same
> {{BatchShuffleReadBufferPool}}. If the IO thread happens to load too many
> buffers for the probe reader, but the downstream will not consume the data,
> which will cause the build reader to be unable to request enough buffers.
> Therefore, deadlock occurs.
> In fact, we realized this problem at the beginning of the design of
> {{SortMergeShuffle}}, so we introduced a timeout mechanism when requesting
> read buffers. If this happens, the downstream task will trigger failover to
> avoid permanent blocking. However, under the default configuration, TPC-DS
> test with 10T data can easily cause the job to fail because of this reason.
> It seems that this problem needs to be solved more better.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)