netvl opened a new pull request, #22854:
URL: https://github.com/apache/flink/pull/22854

   ## What is the purpose of the change
   
   Because lambda "class names" can be different across different JVMs, the 
flame graph in Flink UI is often unreadable, especially with high parallelism 
with large number of task managers (there are 5 task managers here and 
therefore 5 "columns", but if the application has 50+ task managers, this 
becomes completely unreadable):
   <img width="1337" alt="bad-1" 
src="https://github.com/apache/flink/assets/280456/a45ec88d-e5d6-4905-93e3-7d20f872481c";>
   
   This can happen in both Flink's internal logic which uses lambdas:
   <img width="1337" alt="bad-3" 
src="https://github.com/apache/flink/assets/280456/b1f6ca32-989f-4715-bf5e-20c2ae782524";>
   
   and in user code:
   <img width="1337" alt="bad-2" 
src="https://github.com/apache/flink/assets/280456/765e528a-47d1-4796-8b06-96ea7be59f3d";>
   
   The subtask view does help partially, but it is bad for understanding of the 
big picture, when you want to check what happens within the task as a whole.
   
   This PR adds filtering logic to the code which collects stack traces which 
are subsequently used to build flame graphs. This logic removes stack trace 
elements which correspond to lambda invocations, so in the final stack trace 
used for flame graphs lambda calls are not visible:
   
   This does not really drop any useful information (except for the fact that 
lambdas are actually used, which is not that useful) — these elements don't 
even carry source file/line information. Also, as a side note, the 
`Thread.getStackTrace()` method appears to have similar logic inside, because 
you don't see lambda invocations in the stack traces returned by 
`Thread.getStackTrace()` for any given thread.
   
   ## Brief change log
   
   - Added filtering of stack traces to remove lambda-related entries
   - Added tests for this functionality
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     - Manually verified the change by running a 6 node cluster with 1 
JobManager and 5 TaskManagers, and captured flame graphs of a simple job with a 
parallel cpu-heavy map operation before and after the change. The pictures are 
available above.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to