[ 
https://issues.apache.org/jira/browse/FLINK-36510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17911219#comment-17911219
 ] 

Ferenc Csaky commented on FLINK-36510:
--------------------------------------

There is no released Flink version that contains this change, and with this 
change we were able to finally ditch Netty3 which contained a lot of nasty CVEs 
since it EOS for a while.

Chesnay has a valid point regarding it is risky to involve a major transitive 
dep. update in a patch version, but getting rid of Netty3 is something that was 
on the table for years. Anyway, I'm not saying we should definitely not revert 
this, but based on my testing yesterday and today I am not sure Netty is 
causing any kind of leak. Regarding that, I made a comment to FLINK-36290.

So I spent some time to run the Flink E2E Netty shuffle test that [~mapohl] 
linked failed 
[here|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=64748&view=logs&j=e8e46ef5-75cc-564f-c2bd-1797c35cbebe&t=60c49903-2505-5c25-7e46-de91b1737bea&l=12596]
 and caused:
{code:java}
Jan 02 06:20:02 
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
Cannot reserve 4194304 bytes of direct buffer memory (allocated: 140396831, 
limit: 141557760) (connection to 'localhost/127.0.0.1:42031 
[localhost:45071-b0167d]')
Jan 02 06:20:02         at 
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:175)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Jan 02 06:20:02         at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
{code}
That test limits the off-heap memory of Netty drastically (to 
[7MB|https://github.com/apache/flink/blob/bcc44d2d3b8c6de1de074cf0b3ca21b2c38ff1ac/flink-end-to-end-tests/test-scripts/test_netty_shuffle_memory_control.sh#L37]),
 because the reason of that 
[test|https://github.com/apache/flink/blob/master/flink-end-to-end-tests/flink-netty-shuffle-memory-control-test/src/main/java/org/apache/flink/streaming/tests/NettyShuffleMemoryControlTestProgram.java]
 to check Netty's memory footprint when Flink TaskManagers shuffle stuff and 
that data travels through Netty channels.

So on JDK11+ (I tested JDK11 and JDK17) with the memory settings defined in the 
[E2E test 
script|https://github.com/apache/flink/blob/bcc44d2d3b8c6de1de074cf0b3ca21b2c38ff1ac/flink-end-to-end-tests/test-scripts/test_netty_shuffle_memory_control.sh]
 the insufficient memory error for Netty happens right after the TM start, and 
Pekko cannot startup properly. If we give it more memory it will be fine. Or as 
it was pointed out by [~hepin], setting 
{{-Dorg.apache.flink.shaded.netty4.io.netty.tryReflectionSetAccessible=true}} 
as a TM JVM option also solves the problem with the already existing memory 
configuration, cause it reduces the required Netty memory footprint in JDK9+.

Personally, I was not able to reproduce any kind of leak running this shuffle 
test, and I tried to run this shuffle test for 20mins with and without 
{{io.netty.tryReflectionSetAccessible=true}}, if Netty had enough memory from 
the start, there was no leak. During these tests I set TRACE logs for Netty and 
also {{-Dorg.apache.flink.shaded.netty4.io.netty.allocator.type=unpooled}}, and 
{{-Dorg.apache.flink.shaded.netty4.io.netty.leakDetection.level=PARANOID}} were 
added to the TM configs and I got nothing.

To sum it up, based on my findings I am not convinced Netty causes a leak. I 
will reference this comment in the dev mailing list as there were somebody who 
was able to reproduce a leak with a Kafka -> Iceberg job. Also tomorrow I will 
try that kind of job myself as well with both JDK11 and JDK17.

> Upgrade Pekko from 1.0.1 to 1.1.2
> ---------------------------------
>
>                 Key: FLINK-36510
>                 URL: https://issues.apache.org/jira/browse/FLINK-36510
>             Project: Flink
>          Issue Type: Technical Debt
>          Components: Runtime / Coordination
>    Affects Versions: 1.20.0, 1.19.1, 2.0-preview
>            Reporter: Grace Grimwood
>            Assignee: Grace Grimwood
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.0, 1.19.2, 1.20.1
>
>
> Updates Pekko dependency to 1.1.2 which in turn upgrades Netty 3 to 4 
> (addressing FLINK-29065 and removing several CVEs from Flink). Pekko 1.1 also 
> upgrades other dependencies such as slf4j and Jackson. For more details see 
> the [Pekko 1.1 release 
> notes|https://pekko.apache.org/docs/pekko/current/release-notes/releases-1.1.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to