[ 
https://issues.apache.org/jira/browse/FLINK-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935713#comment-16935713
 ] 

Piotr Nowojski edited comment on FLINK-14118 at 9/23/19 10:06 AM:
------------------------------------------------------------------

Yes, I would expect that this scenario that you have reported could affect 
performance. Probably those steps would be enough to reproduce:
* 1000 output channels
* 1ms flushing interval
* 99.9% or 99.99% records go through only a couple (1? 5? 10?) of output 
channels

We might also think about a different solution to this problem. Instead of 
adding an extra synchronisation point (which even if doesn't affect performance 
it complicates threads interactions), we could move the output flushing logic 
to the mailbox thread. Since all records writing happens from that thread, we 
could maintain something like non synchronized {{private boolean 
hasBeenWrittenTo}} flag, set to true on every record write to a 
channel/subpartition and set to false on every flush.

I haven't thought this through, my main concerns here would be how to implement 
efficiently scheduling a "flushAll" mailbox action. However I did a very quick 
PoC (https://github.com/pnowojski/flink/commits/mailbox-output-flusher) and 
benchmarked it with some of the network benchmarks and even the simplest 
solution seems to be ok.

Alternative solution might be to revive the old idea of moving output flusher 
to netty threads, ( https://issues.apache.org/jira/browse/FLINK-8625 ) which 
has a great performance improvement potential, however it has it's own 
unresolved issues ( 
https://github.com/apache/flink/pull/6698#discussion_r223309406 ) and might not 
solve this particular case anyway.


was (Author: pnowojski):
Yes, I would expect that this scenario that you have reported could affect 
performance. Probably those steps would be enough to reproduce:
* 1000 output channels
* 1ms flushing interval
* 99.9% or 99.99% records go through only a couple (1? 5? 10?) of output 
channels

We might also think about a different solution to this problem. Instead of 
adding an extra synchronisation point (which even if doesn't affect performance 
it complicates threads interactions), we could move the output flushing logic 
to the mailbox thread. Since all records writing happens from that thread, we 
could maintain something like non synchronized {{private boolean 
hasBeenWrittenTo}} flag, set to true on every record write to a 
channel/subpartition and set to false on every flush.

I haven't thought this through, my main concerns here would be how to implement 
efficiently scheduling a "flushAll" mailbox action. However I did a very quick 
PoC (https://github.com/pnowojski/flink/commits/mailbox-output-flusher) and 
benchmarked it with some of the network benchmarks and even the simplest 
solution seems to be ok.

> Reduce the unnecessary flushing when there is no data available for flush
> -------------------------------------------------------------------------
>
>                 Key: FLINK-14118
>                 URL: https://issues.apache.org/jira/browse/FLINK-14118
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>            Reporter: Yingjie Cao
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.10.0, 1.9.1, 1.8.3
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new flush implementation which works by triggering a netty user event may 
> cause performance regression compared to the old synchronization-based one. 
> More specifically, when there is exactly one BufferConsumer in the buffer 
> queue of subpartition and no new data will be added for a while in the future 
> (may because of just no input or the logic of the operator is to collect some 
> data for processing and will not emit records immediately), that is, there is 
> no data to send, the OutputFlusher will continuously notify data available 
> and wake up the netty thread, though no data will be returned by the 
> pollBuffer method.
> For some of our production jobs, this will incur 20% to 40% CPU overhead 
> compared to the old implementation. We tried to fix the problem by checking 
> if there is new data available when flushing, if there is no new data, the 
> netty thread will not be notified. It works for our jobs and the cpu usage 
> falls to previous level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to