Re: [PR] [FLINK-33764] Track Heap usage and GC pressure to avoid unnecessary scaling [flink-kubernetes-operator]

via GitHub Wed, 13 Dec 2023 01:40:43 -0800


gyfora commented on PR #726:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/726#issuecomment-1853576202


   > I worry a bit about the implications of enabling this feature by default 
(also see [#726 
(comment)](https://github.com/apache/flink-kubernetes-operator/pull/726#discussion_r1422277362)).
 Can we put this behind a flag?
   > 
   > I feel like the solution to this problem is more complex, but the current 
code is definitely part of the solution. I'm not sure stopping to scale on GC 
pressure / heap usage is always desirable. In a lot of scenarios, scaling up 
might resolve GC pressure / heap usage issues. But after merging this PR, we 
might get stuck in a bad state. Users do not typically monitor their pipelines 
events that closely.
   > 
   > If we discover heap / GC pressure, it looks like we want to let the user 
know, scale up to solve the issue, and block scaling to a lower parallelism. 
Not allowing scaling might actually make the situation worse.
   
   @1996fanrui @mxm 
   
   I don't really want to add more on/off switches but we can consider setting 
the default limit larger or to 1 (effectively disabling this feature) initially 
if we think this needs more iteration.
   
   However I am personally very much in favour of enabling it at least for the 
GC time for the following reason:
   
   When you spend a considerable time in GC (30% is actually a lot) the 
processing is probably extremely slow compared to a state where you have normal 
gc time. In my experience when you are above 30% GC you are usually way above 
it and the degradation is massive, in most of these cases processing comes to 
almost a complete halt relatively speaking.
   
   This means that true processing time and other time based measurements are 
completely off (very low compared to the actual value without memory pressure). 
So the scale up in this case would be very much overshooting, risking a large 
resource / cost spike that is basically 100% wrong. I think this is a big 
production risk that can affect trust in the autoscaler.
   
   So I see 3 options here:
    1. Keep the proposed 30% threshold
    2. Increase the threshold to let's say 50% to be on the slightly safer side
    3. Disable the check by default
    
   I am personally +1 on 1 or 2 with a slight preference towards 1.
   
   I agree with @1996fanrui s concern for the more simplistic heap check should 
be turned off by default. (threshold to 1)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-33764] Track Heap usage and GC pressure to avoid unnecessary scaling [flink-kubernetes-operator]

Reply via email to