hanahmily opened a new issue, #13485:
URL: https://github.com/apache/skywalking/issues/13485

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no 
similar feature requirement.
   
   
   ### Description
   
   
   
   #### **Motivation**
   
   From a recent performance test led by @mrproliu , the BanyanDB liaison gRPC 
server is vulnerable to OOM errors when subjected to high-throughput write 
traffic. If a client sends data faster than the server can process and persist 
it, the gRPC library's internal buffers will grow indefinitely, consuming all 
available heap memory. This is frequently observed in profiling 
`measure.Recv()` under heavy load.
   
   To ensure server stability and prevent crashes, we need to introduce 
mechanisms that:
   
   1.  Actively shed load when the system is under high memory pressure.
   2.  Intelligently configure gRPC's network buffers to provide backpressure 
before the heap is exhausted.
   
   #### **Proposed Solution**
   
   This proposal outlines a two-pronged approach to control heap usage by 
integrating the existing `protector` service with the gRPC server's lifecycle 
and configuration.
   
   **1. Load Shedding via Protector State**
   
   We will implement a gRPC Stream Server Interceptor that queries the 
`protector`'s state before allowing a new stream to be handled.
   
     * **Dependency Injection:** The `liaison/grpc` server will need a 
reference to the `protector` service, which should be passed in during 
initialization.
     * **Interceptor Logic:**
         * For each new incoming stream, the interceptor will check the current 
system state by calling `protector.State()`.
         * If `protector.State()` returns `StateHigh`, it indicates that system 
memory usage has crossed the configured high-water mark.
         * In this `StateHigh` condition, the interceptor will immediately 
reject the new stream with a `codes.ResourceExhausted` gRPC status. This 
provides clear, immediate backpressure to the client, signaling that the server 
is temporarily unable to accept new workloads.
         * If the state is `StateLow`, the stream will be processed as normal.
   
   ```go
   // pseudocode for the interceptor
   func (s *server) protectorLoadSheddingInterceptor(...) error {
       if s.protector.State() == protector.StateHigh {
           s.log.Warn().Msg("rejecting new stream due to high memory pressure")
           return status.Errorf(codes.ResourceExhausted, "server is busy, 
please retry later")
       }
       return handler(srv, ss)
   }
   ```
   
   **2. Dynamic gRPC Buffer Sizing Based on Available Memory**
   
   Instead of using fixed, static buffer sizes, we will dynamically calculate 
the gRPC HTTP/2 flow control windows at server startup based on the available 
system memory reported by the `protector`.
   
     * **Startup Logic:** During the `Serve()` phase of the gRPC server, it 
will query the system's available memory. This can be done by calling the 
protector.
     * **Configuration:** Introduce a new configuration flag, e.g., 
`grpc.buffer.memory-ratio` (defaulting to `0.10` for 10%). This will determine 
what fraction of the *available* system memory should be allocated to gRPC's 
connection-level buffers.
     * **Heuristic for Window Calculation:**
         * `totalBufferSize = availableMemory * memoryRatio`
         * `InitialConnWindowSize = totalBufferSize * 2 / 3`
         * `InitialWindowSize = totalBufferSize * 1 / 3`
         * This 2:1 ratio ensures the connection-level buffer is larger than 
any single stream's buffer, which is a common and effective practice.
     * **Applying the Options:** The calculated values will be passed to 
`grpc.NewServer()` using the `grpc.InitialWindowSize()` and 
`grpc.InitialConnWindowSize()` server options.
     * **Override Mechanism:** The existing static configuration flags for 
window sizes (`grpc.InitialWindowSize`, etc.) should take precedence. If a user 
sets a specific value, the dynamic calculation will be skipped. This allows for 
expert manual tuning.
   
   
   
   ### Use case
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a pull request to implement this on your own?
   
   - [ ] Yes I am willing to submit a pull request on my own!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
notifications-unsubscr...@skywalking.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to