poorbarcode commented on code in PR #24510: URL: https://github.com/apache/pulsar/pull/24510#discussion_r2241999274
########## pip/pip-434.md: ########## @@ -0,0 +1,78 @@ +# PIP-434: Expose Netty channel configuration WRITE_BUFFER_WATER_MARK to pulsar conf and pause receive requests when channel is unwritable + +# Background knowledge & Motivation + +As we discussed along the discussion: https://lists.apache.org/thread/6jfs02ovt13mnhn441txqy5m6knw6rr8 + +> Problem Statement: +> We've encountered a critical issue in our Apache Pulsar clusters where brokers experience Out-Of-Memory (OOM) errors and continuous restarts under specific load patterns. This occurs when Netty channel write buffers become full, leading to a buildup of unacknowledged responses in the broker's memory. + +> Background: +> Our clusters are configured with numerous namespaces, each containing approximately 8,000 to 10,000 topics. Our consumer applications are quite large, with each consumer using a regular expression (regex) pattern to subscribe to all topics within a namespace. + +> The problem manifests particularly during consumer application restarts. When a consumer restarts, it issues a getTopicsOfNamespace request. Due to the sheer number of topics, the response size is extremely large. This massive response overwhelms the socket output buffer, causing it to fill up rapidly. Consequently, the broker's responses get backlogged in memory, eventually leading to the broker's OOM and subsequent restart loop. + +> Solution we got: +> - Expose Netty channel configuration WRITE_BUFFER_WATER_MARK to pulsar conf +> - Stops receive requests continuously once the Netty channel is unwritable, users can use the new config to control the threshold that limits the max bytes that are pending write. + +# Goals + +## In Scope +- Expose Netty channel configuration WRITE_BUFFER_WATER_MARK to pulsar conf +- Stops receive requests continuously once the Netty channel is unwritable, users can use the new config to control the threshold that limits the max bytes that are pending write. + +## Out of Scope + +- This proposal is not in order to add a broker level memory limitation, it only focuses on addressing the OOM caused by the accumulation of a large number of responses in memory due to the channel granularity being unwritable. + +# Detailed Design + +### Configuration + +```shell +# It relates to configuration "WriteBufferHighWaterMark" of Netty Channel Config. If the number of bytes queued in the write buffer exceeds this value, channel writable state will start to return "false". +pulsarChannelWriteBufferHighWaterMark=64k +# It relates to configuration "WriteBufferLowWaterMark" of Netty Channel Config. If the number of bytes queued in the write buffer is smaller than this value, channel writable state will start to return "true". +pulsarChannelWriteBufferLowWaterMark=32k +# Once the writer buffer is full, the channel stops dealing with new requests until it changes to writable +pulsarChannelPauseReceivingRequestsIfUnwritable=false +``` + +### CLI + +### Metrics +| Name | Description | Attributes | Units| +|------------------------------------------------------|---------------------------------------------------------------------------------------------|--------------| --- | +| `pulsar_server_channel_write_buf_memory_used_bytes` | Counter. The number of replicators. | cluster | - | Review Comment: Thanks for mentioning this mistake, I have corrected it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pulsar.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org