ZanderXu commented on PR #8448: URL: https://github.com/apache/hadoop/pull/8448#issuecomment-4384741219
> > Hi @kokonguyen191 @hfutatzhanghb @ZanderXu I had the impression that DFS_ROUTER_ASYNC_RPC_MAX_ASYNCCALL_PERMIT_KEY could control the maximum number of requests. Is it possible to achieve the same effect using this? > > I have the same question with @KeeProMise . @kokonguyen191 Could you please clarify here? IIUC, When we use acquirePermit, Router have put this rpc call into its call queue. So `DFS_ROUTER_ASYNC_RPC_MAX_ASYNCCALL_PERMIT_KEY` can not control the max calls at router side? > Thanks @KeeProMise @hfutatzhanghb for your review. Let me try to explain my understanding of this issue. 1. On the RBF side, there are two relevant thread groups: the 8888 handlers and the NS handlers inside the NS thread pool. 2. The 8888 handlers receive client requests and submit them to the corresponding NS thread pool. These requests are put into the thread pool’s unbounded queue. The NS handlers then take requests from the queue and process them. 3. This is basically a producer-consumer model. The 8888 handlers are the producers, and the NS handlers are the consumers. If the producers are faster than the consumers, the unbounded queue may keep growing and eventually exhaust memory. Once that happens, the whole RBF instance may become unavailable. Unfortunately, there are several cases where the consumers can become slower than the producers: 1. When NS permits are exhausted, NS handlers may have to wait for a permit, up to DFS_ROUTER_FAIRNESS_ACQUIRE_TIMEOUT. 2. Even after an NS handler gets a permit, it may still get blocked while sending the request to the downstream NN. For example, the NN may be slow or unavailable because of GC, a large delete, a large rename, HA failover, or a machine failure. In these cases, NS handlers cannot consume requests fast enough, but the 8888 handlers may continue accepting and enqueueing new requests. This can cause the unbounded queue to grow continuously and eventually trigger OOM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
