dannycranmer commented on pull request #18733:
URL: https://github.com/apache/flink/pull/18733#issuecomment-1040593983


   I have been taking a deeper dive into this issue and am no longer inclined 
to make this change. I have verified that when running on a cluster, each job 
creates it's own `aws-java-sdk-NettyEventLoop` threads, and jobs do not 
interfere with each other. Running test cluster with 1 TM 1 CPU and 2 slots, 
posting 2 identical jobs. First job produces 2 threads:
   ```
   "aws-java-sdk-NettyEventLoop-0-0" daemon prio=5 Id=76 RUNNABLE
   "aws-java-sdk-NettyEventLoop-0-1" daemon prio=5 Id=77 RUNNABLE
   ```
   
   Second job produces 2 threads:
   ```
   "aws-java-sdk-NettyEventLoop-0-0" daemon prio=5 Id=95 RUNNABLE 
   "aws-java-sdk-NettyEventLoop-0-1" daemon prio=5 Id=96 RUNNABLE
   ```
   
   Third job produces more threads etc. Killing jobs does not impact the other 
running jobs. You will notice the thread names are the same. This is because 
they are isolated in different job classloaders and cannot see each other to 
[increment the pool 
number](https://github.com/aws/aws-sdk-java-v2/blob/master/utils/src/main/java/software/amazon/awssdk/utils/ThreadFactoryBuilder.java#L75).
   
   I have also managed to consistently reproduce this issue locally, and backup 
@dmvk findings. To reproduce this I made the following changes ([My test 
code](https://github.com/dannycranmer/flink/commit/6e42d2a052ca1ef2d13679159de9f9ad617121ac)):
   - Switch the Flink job submission to `executeAsync()`
   - Move the test client initialisation to after job submission
   
   The problem in the flaky tests is that the `FlinkUserClassLoader` has the 
`Launcher$AppClassLoader` as it's parent classloader. This is the same 
classloader used by the test code. This means that both test code and Flink job 
can "see" the same ELG. As @dmvk pointed out, the Netty thread inherits current 
thread context classloader. The issue seems be be a race condition, since these 
clients are async, I guess that whoever makes the request first creates the 
Netty thread (not yet confirmed). The test blocks for Flink job to finish, 
which destroys the `FlinkUserClassLoader` and results in the classloader leak 
on the Netty threads.
   
   
![Evaluate_and_flink-parent_–_SharedSdkEventLoopGroup_class__Maven__software_amazon_awssdk_netty-nio-client_2_17_52_](https://user-images.githubusercontent.com/4950503/154118605-e12ede0e-43b2-4acb-86d6-9adb4a4230aa.png)
   
   Based on this, I am not inclined to create a new ELG per sub task, since 
this will potentially create a large resource overhead. This would also impact 
our current Kinesis EFO source, since it already uses the AWS SDK v2 async 
clients. Instead I would prefer to fix it in the test, either by setting a 
separate ELG on the test clients, or migrating to synchronous clients. I prefer 
the latter, since there is no need to spin up a thread pool for these tests!
   
   The tradeoff would be that this issue could still surface for people running 
embedded Flink mini clusters. However, given the compromise I am willing to 
accept this.
   
   @dmvk please give me your thoughts. Also, happy to discuss if easier.
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to