tokers commented on code in PR #7906: URL: https://github.com/apache/apisix/pull/7906#discussion_r969400166
########## docs/en/latest/FAQ.md: ########## @@ -626,6 +626,59 @@ This method only detects whether the APISIX data plane is alive or not. It does ::: +## What are the scenarios with high APISIX latency related to Etcd and how to fix them? + +Etcd is a component of APISIX service discovery and data storage, and its stability is related to the stability of APISIX. + +In actual scenarios, if APISIX uses a certificate to connect to Etcd through HTTPS, the following two problems of high latency for data query or writing may occur: + +1. Query or write data through APISIX Admin API. +2. In the monitoring scenario, Prometheus crawls the APISIX data plane Metrics API timeout. + +These problems related to higher latency seriously affect the service stability of APISIX, and the reason why such problems occur is mainly because Etcd provides two modes of operation: HTTP (HTTPS) and gRPC. And APISIX uses the HTTP (HTTPS) protocol to operate Etcd. +In this scenario, Etcd has a bug about HTTP2: if Etcd is operated over HTTPS (HTTP is not affected), the upper limit of HTTP2 connections is the default 250 in Golang. Therefore, when the number of APISIX data plane nodes is large, once the number of connections between all APISIX nodes and Etcd exceeds this upper limit, the response of APISIX API interface will be very slow. Review Comment: ```suggestion In this scenario, ETCD has a bug about HTTP/2: if ETCD is operated over HTTPS (HTTP is not affected), the upper limit of HTTP2 connections is the default `250` in Golang. Therefore, when the number of APISIX data plane nodes is large, once the number of connections between all APISIX nodes and Etcd exceeds this upper limit, the response of APISIX API interface will be very slow. ``` ########## docs/en/latest/FAQ.md: ########## @@ -626,6 +626,59 @@ This method only detects whether the APISIX data plane is alive or not. It does ::: +## What are the scenarios with high APISIX latency related to Etcd and how to fix them? Review Comment: ```suggestion ## What are the scenarios with high APISIX latency related to ETCD and how to fix them? ``` ########## docs/en/latest/FAQ.md: ########## @@ -626,6 +626,59 @@ This method only detects whether the APISIX data plane is alive or not. It does ::: +## What are the scenarios with high APISIX latency related to Etcd and how to fix them? + +Etcd is a component of APISIX service discovery and data storage, and its stability is related to the stability of APISIX. + +In actual scenarios, if APISIX uses a certificate to connect to Etcd through HTTPS, the following two problems of high latency for data query or writing may occur: + +1. Query or write data through APISIX Admin API. +2. In the monitoring scenario, Prometheus crawls the APISIX data plane Metrics API timeout. + +These problems related to higher latency seriously affect the service stability of APISIX, and the reason why such problems occur is mainly because Etcd provides two modes of operation: HTTP (HTTPS) and gRPC. And APISIX uses the HTTP (HTTPS) protocol to operate Etcd. +In this scenario, Etcd has a bug about HTTP2: if Etcd is operated over HTTPS (HTTP is not affected), the upper limit of HTTP2 connections is the default 250 in Golang. Therefore, when the number of APISIX data plane nodes is large, once the number of connections between all APISIX nodes and Etcd exceeds this upper limit, the response of APISIX API interface will be very slow. + +In Golang, the default upper limit of HTTP2 connections is 250, the code is as follows: + +```go +package http2 + +import ... + +const ( + prefaceTimeout = 10 * time.Second + firstSettingsTimeout = 2 * time.Second // should be in-flight with preface anyway + handlerChunkWriteSize = 4 << 10 + defaultMaxStreams = 250 // TODO: make this 100 as the GFE seems to? + maxQueuedControlFrames = 10000 +) + +``` + +At present, Etcd officially maintains two main branches, 3.4 and 3.5. Review Comment: ```suggestion At present, Etcd officially maintains two main branches, `3.4` and `3.5`. ``` ########## docs/en/latest/FAQ.md: ########## @@ -626,6 +626,59 @@ This method only detects whether the APISIX data plane is alive or not. It does ::: +## What are the scenarios with high APISIX latency related to Etcd and how to fix them? + +Etcd is a component of APISIX service discovery and data storage, and its stability is related to the stability of APISIX. Review Comment: Currently APISIX don't have the ETCD service discovery module. ########## docs/en/latest/FAQ.md: ########## @@ -626,6 +626,59 @@ This method only detects whether the APISIX data plane is alive or not. It does ::: +## What are the scenarios with high APISIX latency related to Etcd and how to fix them? + +Etcd is a component of APISIX service discovery and data storage, and its stability is related to the stability of APISIX. + +In actual scenarios, if APISIX uses a certificate to connect to Etcd through HTTPS, the following two problems of high latency for data query or writing may occur: + +1. Query or write data through APISIX Admin API. +2. In the monitoring scenario, Prometheus crawls the APISIX data plane Metrics API timeout. + +These problems related to higher latency seriously affect the service stability of APISIX, and the reason why such problems occur is mainly because Etcd provides two modes of operation: HTTP (HTTPS) and gRPC. And APISIX uses the HTTP (HTTPS) protocol to operate Etcd. +In this scenario, Etcd has a bug about HTTP2: if Etcd is operated over HTTPS (HTTP is not affected), the upper limit of HTTP2 connections is the default 250 in Golang. Therefore, when the number of APISIX data plane nodes is large, once the number of connections between all APISIX nodes and Etcd exceeds this upper limit, the response of APISIX API interface will be very slow. + +In Golang, the default upper limit of HTTP2 connections is 250, the code is as follows: + +```go +package http2 + +import ... + +const ( + prefaceTimeout = 10 * time.Second + firstSettingsTimeout = 2 * time.Second // should be in-flight with preface anyway + handlerChunkWriteSize = 4 << 10 + defaultMaxStreams = 250 // TODO: make this 100 as the GFE seems to? + maxQueuedControlFrames = 10000 +) + +``` + +At present, Etcd officially maintains two main branches, 3.4 and 3.5. +The 3.4 branch has the recently released 3.4.20 which fixes this issue. +As for the 3.5 branch, in fact, the official is preparing to release the 3.5.5 version a long time ago, but it has not been released so far. So, if you are using a version of Etcd less than 3.5.5, there are several ways to solve this problem: + +1. Change the communication method between APISIX and Etcd from HTTPS to HTTP (not recommended). +2. Fallback version to 3.4.20 (not recommended). +3. Clone the Etcd source code and compile the release-3.5 branch directly (this branch has fixed the problem of HTTP2 connections, but the new version has not been released yet). This method is recommended. Review Comment: That's also not a recommended way to change the ETCD source code since users may deploy ETCD via image. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
