StephanEwen commented on issue #8846: [FLINK-12766][runtime] Dynamically 
allocate TaskExecutor's managed memory to slots.
URL: https://github.com/apache/flink/pull/8846#issuecomment-510169482
 
 
   The code is overall in decent shape.
   I am a bit unsure, though, about this approach.
   
   For example all the changes on the TaskManager result in something quite 
unintuitive to me.
     - there are slots, they have a resource profile
     - then an allocated resource profile, which is possibly a subset of that.
     - some operations (for example to identify idempotent requests) now take 
the new resource profile into account. I cannot even say whether that is 
correct or not - probably not breaking anything immediately, but breaks with 
the design of how requests are identified and can cause tricky bugs later.
     - the TM may now request requests for free slots based on its bookkeeping. 
While this should typically not happen when the RM bookkeeping is in sync, what 
happens when the views diverge, due to a request timeout? The slot allocation 
protocol is so sensitive, do we really want to add this additional complexity?
   
   I assume that the reason for all the TM changes is a combination of 
     - when the JM allocates a slot, it actually gets that slot offered from 
the TM directly, and not from the RM. So the profile somehow needs to go 
through the TM to end up in the Slot on the JM
     - some safety net bookkeeping
   
   This solution is a temporary solution, an in the long run we probably want 
to either have fully dynamically sized slots, allocated and sized on demand, or 
keep the slots are static and change the operators in the APIs to adjust their 
resource consumption to what is there in the slots.
   
   Having a temporary solution, of this big complexity, touching something so 
sensitive as the slot allocation, and for a new SQL runtime that is in its 
first release and might change quite a bit in the next release... I simply do 
not feel comfortable to merge this. The risk of bugs and regression, and the 
added complexity seems so high compared to the benefit.
   
   I would be happy to double down on a FLIP about how resource management 
should look like in the next versions and make sure we have something really 
well working in the following release.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to