Heya team, We hit some production troubles pertaining to clients sending very large multi-gets. Even with otherwise reasonable cell- and row-size limits, even with maximum multi-action sizes in place, even with QoS and our fancy IO-based Quotas, the pressure was enough to push over a region server or three. It got me thinking that we need some kind of pressure gauge in the RPC layer that can protect the RS. This wouldn't be a QoS or Quota kind of feature, it's not about fairness between tenants, rather it's a safety mechanism, a kind of pressure valve. I wonder if something like this already exists or maybe you know of a ticket already filed with some existing discussion.
My napkin-sketch is something like a metric that tracks the amount of heap size consumed by active request and response objects. When the metrics hits a limit, we start to reject new requests with a retryable exception. I don't know if we want the overhead of tracking this value exactly, so maybe the value is populated only by new requests and then we have some crude mechanism of decay. Does Netty already have something like this? I'd say this is in lieu of an actual streaming RPC harness, but I think even a streaming system would benefit from such a backpressure strategy. It occurs to me that I don't know the current state of active memory tracking in the region server. I recall there was some work to make memstore and blockcache resize dynamically. Maybe this new system adds a 3rd component to the computation. Thoughts? Ideas? Thanks, Nick