[
https://issues.apache.org/jira/browse/FLINK-12538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Piotr Nowojski closed FLINK-12538.
----------------------------------
Resolution: Won't Fix
> Network notifyDataAvailable() only called after getting a new buffer
> --------------------------------------------------------------------
>
> Key: FLINK-12538
> URL: https://issues.apache.org/jira/browse/FLINK-12538
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.6.3, 1.7.2, 1.8.0, 1.9.0
> Reporter: Nico Kruber
> Priority: Major
> Labels: stale-major
>
> There is a potential regression in Flink 1.5+ which came with the low-latency
> changes. Whenever the {{RecordWriter}} finishes a buffer, it will first ask
> for a new buffer, then adds it to the appropriate result subpartition which
> notifies Netty of data being available.
> In back-pressured scenarios where all buffers from the local pool are taken,
> it may happen that you do not immediately get a new buffer and have to wait
> for as long as it takes to get it before Netty can make use of the finished
> network buffer. Pre 1.5, Flink always immediately notified the downwards
> stack.
> Although we do still have the output flusher notifying Netty within at most
> 100ms (by default), the new behaviour may actually decrease throughput and
> latency in a back-pressured scenario.
> Having a quick look at the code, changing this behaviour is probably not too
> difficult but only needs to take care not to introduce additional locking /
> locking multiple times compared to now. Things to do/consider:
> * {{PipelinedSubpartition#add()}} contains some optimisations to avoid
> unnecessary flushes but these conditions are under a lock -> try to not
> acquire it twice
> * {{RecordWriter#requestNewBufferBuilder()}} could therefore maybe have an
> optimised path with a non-blocking buffer builder request if successful and
> if not, notify/flush and do another blocking request
> After talking to [~pnowojski] offline, we are not sure how grave the issue is
> and whether we would improve by changing it. If you are willing to take a
> look and have code changing the current behaviour, please verify that it does
> not cause any performance regression itself and actually does improve some
> scenario (shown by a performance test, e.g. via
> https://github.com/dataArtisans/flink-benchmarks ).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)