hanahmily opened a new issue, #13485: URL: https://github.com/apache/skywalking/issues/13485
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no similar feature requirement. ### Description #### **Motivation** From a recent performance test led by @mrproliu , the BanyanDB liaison gRPC server is vulnerable to OOM errors when subjected to high-throughput write traffic. If a client sends data faster than the server can process and persist it, the gRPC library's internal buffers will grow indefinitely, consuming all available heap memory. This is frequently observed in profiling `measure.Recv()` under heavy load. To ensure server stability and prevent crashes, we need to introduce mechanisms that: 1. Actively shed load when the system is under high memory pressure. 2. Intelligently configure gRPC's network buffers to provide backpressure before the heap is exhausted. #### **Proposed Solution** This proposal outlines a two-pronged approach to control heap usage by integrating the existing `protector` service with the gRPC server's lifecycle and configuration. **1. Load Shedding via Protector State** We will implement a gRPC Stream Server Interceptor that queries the `protector`'s state before allowing a new stream to be handled. * **Dependency Injection:** The `liaison/grpc` server will need a reference to the `protector` service, which should be passed in during initialization. * **Interceptor Logic:** * For each new incoming stream, the interceptor will check the current system state by calling `protector.State()`. * If `protector.State()` returns `StateHigh`, it indicates that system memory usage has crossed the configured high-water mark. * In this `StateHigh` condition, the interceptor will immediately reject the new stream with a `codes.ResourceExhausted` gRPC status. This provides clear, immediate backpressure to the client, signaling that the server is temporarily unable to accept new workloads. * If the state is `StateLow`, the stream will be processed as normal. ```go // pseudocode for the interceptor func (s *server) protectorLoadSheddingInterceptor(...) error { if s.protector.State() == protector.StateHigh { s.log.Warn().Msg("rejecting new stream due to high memory pressure") return status.Errorf(codes.ResourceExhausted, "server is busy, please retry later") } return handler(srv, ss) } ``` **2. Dynamic gRPC Buffer Sizing Based on Available Memory** Instead of using fixed, static buffer sizes, we will dynamically calculate the gRPC HTTP/2 flow control windows at server startup based on the available system memory reported by the `protector`. * **Startup Logic:** During the `Serve()` phase of the gRPC server, it will query the system's available memory. This can be done by calling the protector. * **Configuration:** Introduce a new configuration flag, e.g., `grpc.buffer.memory-ratio` (defaulting to `0.10` for 10%). This will determine what fraction of the *available* system memory should be allocated to gRPC's connection-level buffers. * **Heuristic for Window Calculation:** * `totalBufferSize = availableMemory * memoryRatio` * `InitialConnWindowSize = totalBufferSize * 2 / 3` * `InitialWindowSize = totalBufferSize * 1 / 3` * This 2:1 ratio ensures the connection-level buffer is larger than any single stream's buffer, which is a common and effective practice. * **Applying the Options:** The calculated values will be passed to `grpc.NewServer()` using the `grpc.InitialWindowSize()` and `grpc.InitialConnWindowSize()` server options. * **Override Mechanism:** The existing static configuration flags for window sizes (`grpc.InitialWindowSize`, etc.) should take precedence. If a user sets a specific value, the dynamic calculation will be skipped. This allows for expert manual tuning. ### Use case _No response_ ### Related issues _No response_ ### Are you willing to submit a pull request to implement this on your own? - [ ] Yes I am willing to submit a pull request on my own! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@skywalking.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org