[
https://issues.apache.org/jira/browse/FLINK-36510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17911219#comment-17911219
]
Ferenc Csaky commented on FLINK-36510:
--------------------------------------
There is no released Flink version that contains this change, and with this
change we were able to finally ditch Netty3 which contained a lot of nasty CVEs
since it EOS for a while.
Chesnay has a valid point regarding it is risky to involve a major transitive
dep. update in a patch version, but getting rid of Netty3 is something that was
on the table for years. Anyway, I'm not saying we should definitely not revert
this, but based on my testing yesterday and today I am not sure Netty is
causing any kind of leak. Regarding that, I made a comment to FLINK-36290.
So I spent some time to run the Flink E2E Netty shuffle test that [~mapohl]
linked failed
[here|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=64748&view=logs&j=e8e46ef5-75cc-564f-c2bd-1797c35cbebe&t=60c49903-2505-5c25-7e46-de91b1737bea&l=12596]
and caused:
{code:java}
Jan 02 06:20:02
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
Cannot reserve 4194304 bytes of direct buffer memory (allocated: 140396831,
limit: 141557760) (connection to 'localhost/127.0.0.1:42031
[localhost:45071-b0167d]')
Jan 02 06:20:02 at
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:175)
~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Jan 02 06:20:02 at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)
~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
{code}
That test limits the off-heap memory of Netty drastically (to
[7MB|https://github.com/apache/flink/blob/bcc44d2d3b8c6de1de074cf0b3ca21b2c38ff1ac/flink-end-to-end-tests/test-scripts/test_netty_shuffle_memory_control.sh#L37]),
because the reason of that
[test|https://github.com/apache/flink/blob/master/flink-end-to-end-tests/flink-netty-shuffle-memory-control-test/src/main/java/org/apache/flink/streaming/tests/NettyShuffleMemoryControlTestProgram.java]
to check Netty's memory footprint when Flink TaskManagers shuffle stuff and
that data travels through Netty channels.
So on JDK11+ (I tested JDK11 and JDK17) with the memory settings defined in the
[E2E test
script|https://github.com/apache/flink/blob/bcc44d2d3b8c6de1de074cf0b3ca21b2c38ff1ac/flink-end-to-end-tests/test-scripts/test_netty_shuffle_memory_control.sh]
the insufficient memory error for Netty happens right after the TM start, and
Pekko cannot startup properly. If we give it more memory it will be fine. Or as
it was pointed out by [~hepin], setting
{{-Dorg.apache.flink.shaded.netty4.io.netty.tryReflectionSetAccessible=true}}
as a TM JVM option also solves the problem with the already existing memory
configuration, cause it reduces the required Netty memory footprint in JDK9+.
Personally, I was not able to reproduce any kind of leak running this shuffle
test, and I tried to run this shuffle test for 20mins with and without
{{io.netty.tryReflectionSetAccessible=true}}, if Netty had enough memory from
the start, there was no leak. During these tests I set TRACE logs for Netty and
also {{-Dorg.apache.flink.shaded.netty4.io.netty.allocator.type=unpooled}}, and
{{-Dorg.apache.flink.shaded.netty4.io.netty.leakDetection.level=PARANOID}} were
added to the TM configs and I got nothing.
To sum it up, based on my findings I am not convinced Netty causes a leak. I
will reference this comment in the dev mailing list as there were somebody who
was able to reproduce a leak with a Kafka -> Iceberg job. Also tomorrow I will
try that kind of job myself as well with both JDK11 and JDK17.
> Upgrade Pekko from 1.0.1 to 1.1.2
> ---------------------------------
>
> Key: FLINK-36510
> URL: https://issues.apache.org/jira/browse/FLINK-36510
> Project: Flink
> Issue Type: Technical Debt
> Components: Runtime / Coordination
> Affects Versions: 1.20.0, 1.19.1, 2.0-preview
> Reporter: Grace Grimwood
> Assignee: Grace Grimwood
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.0.0, 1.19.2, 1.20.1
>
>
> Updates Pekko dependency to 1.1.2 which in turn upgrades Netty 3 to 4
> (addressing FLINK-29065 and removing several CVEs from Flink). Pekko 1.1 also
> upgrades other dependencies such as slf4j and Jackson. For more details see
> the [Pekko 1.1 release
> notes|https://pekko.apache.org/docs/pekko/current/release-notes/releases-1.1.html].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)