[GitHub] [flink] dannycranmer commented on pull request #18733: [FLINK-26064][connectors/firehose,kinesis] Using own event loop group…

GitBox Tue, 15 Feb 2022 09:53:34 -0800


dannycranmer commented on pull request #18733:
URL: https://github.com/apache/flink/pull/18733#issuecomment-1040593983

I have been taking a deeper dive into this issue and am no longer inclined
to make this change. I have verified that when running on a cluster, each job
creates it's own `aws-java-sdk-NettyEventLoop` threads, and jobs do not
interfere with each other. Running test cluster with 1 TM 1 CPU and 2 slots,
posting 2 identical jobs. First job produces 2 threads:
```
"aws-java-sdk-NettyEventLoop-0-0" daemon prio=5 Id=76 RUNNABLE
"aws-java-sdk-NettyEventLoop-0-1" daemon prio=5 Id=77 RUNNABLE
```

Second job produces 2 threads:
```
"aws-java-sdk-NettyEventLoop-0-0" daemon prio=5 Id=95 RUNNABLE
"aws-java-sdk-NettyEventLoop-0-1" daemon prio=5 Id=96 RUNNABLE
```

Third job produces more threads etc. Killing jobs does not impact the other
running jobs. You will notice the thread names are the same. This is because
they are isolated in different job classloaders and cannot see each other to
[increment the pool
number](https://github.com/aws/aws-sdk-java-v2/blob/master/utils/src/main/java/software/amazon/awssdk/utils/ThreadFactoryBuilder.java#L75).

I have also managed to consistently reproduce this issue locally, and backup
@dmvk findings. To reproduce this I made the following changes ([My test
code](https://github.com/dannycranmer/flink/commit/6e42d2a052ca1ef2d13679159de9f9ad617121ac)):
- Switch the Flink job submission to `executeAsync()`
- Move the test client initialisation to after job submission

The problem in the flaky tests is that the `FlinkUserClassLoader` has the
`Launcher$AppClassLoader` as it's parent classloader. This is the same
classloader used by the test code. This means that both test code and Flink job
can "see" the same ELG. As @dmvk pointed out, the Netty thread inherits current
thread context classloader. The issue seems be be a race condition, since these
clients are async, I guess that whoever makes the request first creates the
Netty thread (not yet confirmed). The test blocks for Flink job to finish,
which destroys the `FlinkUserClassLoader` and results in the classloader leak
on the Netty threads.

![Evaluate_and_flink-parent_–_SharedSdkEventLoopGroup_class__Maven__software_amazon_awssdk_netty-nio-client_2_17_52_](https://user-images.githubusercontent.com/4950503/154118605-e12ede0e-43b2-4acb-86d6-9adb4a4230aa.png)

Based on this, I am not inclined to create a new ELG per sub task, since
this will potentially create a large resource overhead. This would also impact
our current Kinesis EFO source, since it already uses the AWS SDK v2 async
clients. Instead I would prefer to fix it in the test, either by setting a
separate ELG on the test clients, or migrating to synchronous clients. I prefer
the latter, since there is no need to spin up a thread pool for these tests!

The tradeoff would be that this issue could still surface for people running
embedded Flink mini clusters. However, given the compromise I am willing to
accept this.

@dmvk please give me your thoughts. Also, happy to discuss if easier.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] dannycranmer commented on pull request #18733: [FLINK-26064][connectors/firehose,kinesis] Using own event loop group…

Reply via email to