[ 
https://issues.apache.org/jira/browse/FLINK-18738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175612#comment-17175612
 ] 

Stephan Ewen commented on FLINK-18738:
--------------------------------------

Thanks a lot for the discussion and the detailed plan.
I have a few questions, just to double check some assumptions.

*Process Model*

Generally, going with one Python process per slot seems fair, but that means 
the Python process executes multiple operators.
How well does that work? Is there once gRPC connection that handles all 
operators, or are there multiple streams between the processes (one per 
operator)? Are there any issues with GIL or so that impact performance?

*Memory Management*

I get the reason to decouple the managed memory and the Python memory. However, 
it makes the memory model yet more complicated. At least, I would go with the 
following:
  - There should be no impact for existing users, so any Python memory should 
be zero by default.
  - The PyFlink package (from {{pip install}}) could change that value by 
default to something more typical.
  - A good exception should help users understand which parameter to configure 
when no python memory is configured.

The managed memory integration on the other hand could be quite simple. The 
Python process could be a simple external resource claiming budget from the 
Memory Manager. The per-slot bookkeeping, the singleton initialization, ref 
counting, thread-safe shutdown, all of that is already there from the RocksDB 
integration. It should be straightforward to just reduce this.

The advantage of this is that there is
  - simple, no extended memory model
  - no wasted memory in session clusters that are used in a mixed way (this is 
probably not too important to optimize for, though). 

> Revisit resource management model for python processes.
> -------------------------------------------------------
>
>                 Key: FLINK-18738
>                 URL: https://issues.apache.org/jira/browse/FLINK-18738
>             Project: Flink
>          Issue Type: Task
>          Components: API / Python, Runtime / Coordination
>            Reporter: Xintong Song
>            Assignee: Xintong Song
>            Priority: Major
>             Fix For: 1.12.0
>
>
> This ticket is for tracking the effort towards a proper long-term resource 
> management model for python processes.
> In FLINK-17923, we run into problems due to python processes are not well 
> integrate with the task manager resource management mechanism. A temporal 
> workaround has been merged for release-1.11.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to