hansedong opened a new issue, #7078: URL: https://github.com/apache/apisix/issues/7078
### Current Behavior I have encountered a problem in apisix that cannot communicate properly with etcd through the admin api. There are 3 etcd nodes in my environment. When I start apisix, everything is normal. Moreover, I can also operate resources such as Route and Upstreams through the apisix admin api. However, after a period of time (the specific time is uncertain, generally a few hours), operations via the apisix admin api will time out. At this point, the following error will appear in the apisix log: ``` 2022/05/19 10:03:03 [warn] 13654#13654: *6638791 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:08 [warn] 13650#13650: *6635827 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:09 [warn] 13650#13650: *6643178 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] health_check.lua:90: report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:20 [warn] 13655#13655: *6644402 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "PUT /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:20 [warn] 13650#13650: *6637242 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] health_check.lua:90: report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:32 [warn] 13650#13650: *6645920 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "PUT /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:33 [warn] 13654#13654: *6638791 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" ``` > 10.152.6.32:2379 is a etcd node At this time, the request I send to the apisix admin api will also get stuck: ``` [root@knode10-132-14-202 operation]# curl -v ""http://adminapi-apisix.xxx.com:9180/apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80"" -H 'X-API-KEY: 3f38ce3d332sffc6418d9351085fb61d' * About to connect() to 10.132.15.138 port 9180 (#0) * Trying 10.132.15.138... * Connected to 10.132.15.138 (10.132.15.138) port 9180 (#0) > GET /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1 > User-Agent: curl/7.29.0 > Host: 10.132.15.138:9180 > Accept: */* > X-API-KEY: 3f38ce3adf742fc6418d9351085fb61d > # Here it will get stuck, without any output ``` However, there is no problem operating etcd via etcdctl. ```shell [root@knode10-152-6-32 operation]# ETCDCTL_API=3 /usr/local/bin/etcdctl --cacert=/etc/etcd/etcdSSL/ca.pem --cert=/etc/etcd/etcdSSL/clientssl/apisix/apisix.pem --key=/etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --endpoints=https://10.152.6.32:2379 get --prefix /apisix/upstreams/ms-aos-xxxappg-gp-418241-80 ms-aos-xxxappg-gp-418241-80 {"update_time":1652925676,"labels":{"env-type":"staging","appcode":"xxxappg","platform":"bh","version":"10006289"},"create_time":1652754430,"type":"roundrobin","pass_host":"pass","nodes":{"10.131.110.124:80":10},"scheme":"http","hash_on":"vars","id":"ms-aos-xxxappg-gp-418241-80","desc":"upstream for appcode: xxxappg and version: 10006289","name":"ms-aos-xxxappg-gp-418241-80"} ``` This etcd node can be operated through etcdctl, so from the perspective of etcd cluster, etcd is normally served. I also tried to operate the same etcd node with the etcd watch command, and the command was executed normally. So, I guess, there is something wrong with apisix's watch query on etcd. Then, after I adjusted the log level of etcd cluster from `info` to `debug`, I saw the following logs in etcd: ``` ay 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.156+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"} May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"} May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"} May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"} May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:456","msg":"failed to send watch control response to gRPC stream","error":"rpc error: code = Unavailable desc = transport is closing"} May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.158+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"} ``` When I don't start apisix again, etcd does not have the above error log. After starting apisix for a period of time, watch-related errors begin to appear, and the time range is consistent with the error log in apisix error.log. My configuration file content is as follows: ```yaml ... enable_resolv_search_opt: true # enable search option in resolv.conf ssl: enable: true listen: # APISIX listening port in https. - 443 # - port: 9444 # enable_http2: true # If not set, the default value is `false`. # - ip: 127.0.0.3 # Specific IP, If not set, the default value is `0.0.0.0`. # port: 9445 # enable_http2: true enable_http2: true # Not recommend: This parameter should be set via the `listen`. # listen_port: 9443 # Not recommend: This parameter should be set via the `listen`. ssl_trusted_certificate: /etc/etcd/etcdSSL/ca.pem # Specifies a file path with trusted CA certificates in the PEM format ... etcd: host: # it's possible to define multiple etcd hosts addresses of the same etcd cluster. - "https://10.152.6.32:2379" # multiple etcd address, if your etcd cluster enables TLS, please use https scheme, - "https://10.152.6.33:2379" # e.g. https://127.0.0.1:2379. - "https://10.152.6.34:2379" prefix: /apisix # apisix configurations prefix timeout: 30 # 30 seconds #resync_delay: 5 # when sync failed and a rest is needed, resync after the configured seconds plus 50% random jitter #health_check_timeout: 10 # etcd retry the unhealthy nodes after the configured seconds #user: root # root username for etcd #password: 5tHkHhYkjr6cQY # root password for etcd tls: # To enable etcd client certificate you need to build APISIX-OpenResty, see # https://apisix.apache.org/docs/apisix/how-to-build/#step-6-build-openresty-for-apache-apisix cert: /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem # path of certificate used by the etcd client key: /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem # path of key used by the etcd client verify: false # whether to verify the etcd endpoint certificate when setup a TLS connection to etcd, # the default value is true, e.g. the certificate will be verified strictly. sni: etcdserver-apisix-staging.mfwdev.com # the SNI for etcd TLS requests. If missed, the host part of the URL will be used. ``` My etcd certificate is private, but I'm sure my etcd certificate is fine, otherwise apisix wouldn't be able to communicate with etcd properly in the beginning. Also, when there is a problem with apisix communicating with etcd, I also get stuck executing `apisix init_etcd` without any output. What I want to ask is, for this kind of problem, how should I troubleshoot it, and what is the possible point of the problem? ### Expected Behavior _No response_ ### Error Logs apisix error log: ``` 2022/05/19 10:03:03 [warn] 13654#13654: *6638791 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:08 [warn] 13650#13650: *6635827 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:09 [warn] 13650#13650: *6643178 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] health_check.lua:90: report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:20 [warn] 13655#13655: *6644402 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "PUT /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:20 [warn] 13650#13650: *6637242 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] health_check.lua:90: report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:32 [warn] 13650#13650: *6645920 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "PUT /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: "10.132.63.69:9180" 2022/05/19 10:03:33 [warn] 13654#13654: *6638791 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180" ``` etcd error log: ``` ay 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.156+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"} May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"} May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"} May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"} May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:456","msg":"failed to send watch control response to gRPC stream","error":"rpc error: code = Unavailable desc = transport is closing"} May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.158+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}a ``` ### Steps to Reproduce I'm not sure how to reproduce this issue ### Environment - APISIX version (run `apisix version`): `2.13.1` - Operating system (run `uname -a`): `Linux knode10-132-15-138 4.14.105-19-0023 #1 SMP Mon Jan 10 17:53:54 CST 2022 x86_64 x86_64 x86_64 GNU/Linux` - OpenResty / Nginx version (run `openresty -V` or `nginx -V`): ``` - nginx version: openresty/1.19.9.1 built by gcc 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC) built with OpenSSL 1.1.1n 15 Mar 2022 TLS SNI support enabled configure arguments: --prefix=/usr/local/openresty/nginx --with-cc-opt='-O2 -DAPISIX_BASE_VER=1.19.9.1.5 -DNGX_LUA_ABORT_AT_PANIC -I/usr/local/openresty/zlib/include -I/usr/local/openresty/pcre/include -I/usr/local/openresty/openssl111/include' --add-module=../ngx_devel_kit-0.3.1 --add-module=../echo-nginx-module-0.62 --add-module=../xss-nginx-module-0.06 --add-module=../ngx_coolkit-0.2 --add-module=../set-misc-nginx-module-0.32 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.08 --add-module=../srcache-nginx-module-0.32 --add-module=../ngx_lua-0.10.20 --add-module=../ngx_lua_upstream-0.07 --add-module=../headers-more-nginx-module-0.33 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.19 --add-module=../redis2-nginx-module-0.15 --add-module=../redis-nginx-module-0.3.7 --add-module=../ngx_stream_lua-0.0.10 --with-ld-opt='-Wl,-rpath,/usr/local/openresty/luajit/lib -Wl,-rpath,/usr/local/openresty/wasmtime-c-ap i/lib -L/usr/local/openresty/zlib/lib -L/usr/local/openresty/pcre/lib -L/usr/local/openresty/openssl111/lib -Wl,-rpath,/usr/local/openresty/zlib/lib:/usr/local/openresty/pcre/lib:/usr/local/openresty/openssl111/lib' --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../mod_dubbo --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../ngx_multi_upstream_module --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../apisix-nginx-module --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../apisix-nginx-module/src/stream --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../wasm-nginx-module --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../lua-var-nginx-module --with-poll_module --with-pcre-jit --with-stream --with-stream_ssl_module --with-stream_ssl_preread_module --with-http_v2_module --without-mail_pop3_module --without-mail_imap_module --without-mail_smtp_module --with-http_stub_status_module --with-http_realip_module --with-http_addition_module --with-http_auth_request_modul e --with-http_secure_link_module --with-http_random_index_module --with-http_gzip_static_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-threads --with-compat --with-stream --with-http_ssl_module ``` - etcd version, if relevant (run `curl http://127.0.0.1:9090/v1/server_info`): `{"etcdserver":"3.5.4","etcdcluster":"3.5.0"}` - APISIX Dashboard version, if relevant: - Plugin runner version, for issues related to plugin runners: - LuaRocks version, for installation issues (run `luarocks --version`): -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
