[ 
https://issues.apache.org/jira/browse/FLINK-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944286#comment-16944286
 ] 

Piotr Nowojski commented on FLINK-14118:
----------------------------------------

There were some smaller changes, probably insignificant changes. Still I 
wouldn't like to risk introducing some critical bug/regression:
1. Based on how fragile network stack can be for a subtle bugs and the way how 
not well tested are our bug fixes releases I wouldn't be back-porting it. 
2. If we merge it to release-1.9 branch now, I'm pretty sure this improvement 
would be released as part of 1.9.x branch way sooner then 1.10.
3. For me this not necessarily a bug, but a new feature/improvement. Me and 
Nico were aware of this potential regression, but were thinking that the fix 
would bring even more harm - apparently incorrectly.
4. Nobody has reported it for 2 years. Probably only a small fraction of the 
users (high parallelism, high throughput [no RocksDB, light records, etc...], 
high ratio of idling vs busy Tasks) can experience it and/or regression was not 
visible for most of the users among the general low latency improvements.



> Reduce the unnecessary flushing when there is no data available for flush
> -------------------------------------------------------------------------
>
>                 Key: FLINK-14118
>                 URL: https://issues.apache.org/jira/browse/FLINK-14118
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>            Reporter: Yingjie Cao
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new flush implementation which works by triggering a netty user event may 
> cause performance regression compared to the old synchronization-based one. 
> More specifically, when there is exactly one BufferConsumer in the buffer 
> queue of subpartition and no new data will be added for a while in the future 
> (may because of just no input or the logic of the operator is to collect some 
> data for processing and will not emit records immediately), that is, there is 
> no data to send, the OutputFlusher will continuously notify data available 
> and wake up the netty thread, though no data will be returned by the 
> pollBuffer method.
> For some of our production jobs, this will incur 20% to 40% CPU overhead 
> compared to the old implementation. We tried to fix the problem by checking 
> if there is new data available when flushing, if there is no new data, the 
> netty thread will not be notified. It works for our jobs and the cpu usage 
> falls to previous level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to