[
https://issues.apache.org/jira/browse/FLINK-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann updated FLINK-12329:
----------------------------------
Priority: Critical (was: Major)
> Netty thread deadlock bug of the SpilledSubpartitionView
> --------------------------------------------------------
>
> Key: FLINK-12329
> URL: https://issues.apache.org/jira/browse/FLINK-12329
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.8.0
> Reporter: Yingjie Cao
> Priority: Critical
> Fix For: 1.9.0
>
>
> The Netty thread may be blocked when using the blocking batch mode of FLINK.
> In my opinion, the combination of several designs, including request buffer
> blocking, zero coy (the buffer is recycled only after it is sent out to
> network), limited number of buffers (only two buffers and not configurable)
> and backlog data is not ready (must request buffer and read, pipeline mode
> dose not have the problem), leads to this bug.
> The flowing processing flow of Netty thread can block itself (note that
> writeAndFlush dose not mean the buffer is send out to the network).
> 1. request and read the first buffer -> write and flush the first buffer ->
> send the first buffer to network -> request and read the second buffer ->
> write and flush the first buffer -> no credit -> add credit -> request and
> read buffer -> blocking (the second buffer is not sent out)
> 2. no credit -> add credit -> request and read buffer -> write and flush the
> buffer -> no credit -> add credit -> request and read buffer -> blocking
> (the previous read buffer is not sent out)
> How to reproduce?
> The bug is easy to be reproduced, two vertices with a blocking edge can
> reproduce it. Large parallelism, small number of slots per TM and large data
> volume make it easy to reproduce the bug.Setting the parallelism to 100, the
> number of slots per TM to 1 and more than 10M data per subpartition will be
> ok.
> How to fix?
> The new mmappartition implementation can fix this problem because the number
> of buffers is not limited and the data is not loaded until sent.
> The bug can be also fixed based on the old implementation. Firstly, the
> buffer request should not be blocking. Besides, the NetworkSequenceViewReader
> should enqueue as available reader when it is available for read and is not
> registered currently.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)