[GitHub] [apisix] hansedong opened a new issue, #7078: bug: when apisix starts for a while, its communication with etcd starts to time out

GitBox Wed, 18 May 2022 20:24:36 -0700


hansedong opened a new issue, #7078:
URL: https://github.com/apache/apisix/issues/7078


   ### Current Behavior
   
   I have encountered a problem in apisix that cannot communicate properly with 
etcd through the admin api.
   
   There are 3 etcd nodes in my environment. When I start apisix, everything is 
normal. Moreover, I can also operate resources such as Route and Upstreams 
through the apisix admin api. However, after a period of time (the specific 
time is uncertain, generally a few hours), operations via the apisix admin api 
will time out.
   
   At this point, the following error will appear in the apisix log:
   
   ```
   2022/05/19 10:03:03 [warn] 13654#13654: *6638791 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:08 [warn] 13650#13650: *6635827 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:09 [warn] 13650#13650: *6643178 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] health_check.lua:90: 
report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, 
client: 10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:20 [warn] 13655#13655: *6644402 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "PUT 
/apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:20 [warn] 13650#13650: *6637242 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] health_check.lua:90: 
report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, 
client: 10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:32 [warn] 13650#13650: *6645920 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "PUT 
/apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:33 [warn] 13654#13654: *6638791 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   ```
   
   > 10.152.6.32:2379 is a etcd node 
   
   At this time, the request I send to  the apisix admin api will also get 
stuck:
   
   ```
   [root@knode10-132-14-202 operation]# curl  -v 
""http://adminapi-apisix.xxx.com:9180/apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80"";
  -H 'X-API-KEY: 3f38ce3d332sffc6418d9351085fb61d'
   * About to connect() to 10.132.15.138 port 9180 (#0)
   *   Trying 10.132.15.138...
   * Connected to 10.132.15.138 (10.132.15.138) port 9180 (#0)
   > GET /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1
   > User-Agent: curl/7.29.0
   > Host: 10.132.15.138:9180
   > Accept: */*
   > X-API-KEY: 3f38ce3adf742fc6418d9351085fb61d
   >
   
   # Here it will get stuck, without any output
   ```
   
   However, there is no problem operating etcd via etcdctl.
   
   ```shell
   [root@knode10-152-6-32 operation]# ETCDCTL_API=3 /usr/local/bin/etcdctl 
--cacert=/etc/etcd/etcdSSL/ca.pem   
--cert=/etc/etcd/etcdSSL/clientssl/apisix/apisix.pem 
--key=/etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem 
--endpoints=https://10.152.6.32:2379 get --prefix 
/apisix/upstreams/ms-aos-xxxappg-gp-418241-80
   
   ms-aos-xxxappg-gp-418241-80
   
{"update_time":1652925676,"labels":{"env-type":"staging","appcode":"xxxappg","platform":"bh","version":"10006289"},"create_time":1652754430,"type":"roundrobin","pass_host":"pass","nodes":{"10.131.110.124:80":10},"scheme":"http","hash_on":"vars","id":"ms-aos-xxxappg-gp-418241-80","desc":"upstream
 for appcode: xxxappg and version: 
10006289","name":"ms-aos-xxxappg-gp-418241-80"}
   ```
   
   This etcd node can be operated through etcdctl, so from the perspective of 
etcd cluster, etcd is normally served. I also tried to operate the same etcd 
node with the etcd watch command, and the command was executed normally. So, I 
guess, there is something wrong with apisix's watch query on etcd.
   
   Then, after I adjusted the log level of etcd cluster from `info` to `debug`, 
I saw the following logs in etcd:
   
   ```
   ay 19 10:57:09 knode10-152-6-32 etcd[5078]: 
{"level":"debug","ts":"2022-05-19T10:57:09.156+0800","caller":"v3rpc/watch.go:191","msg":"failed
 to receive watch request from gRPC stream","error":"rpc error: code = Canceled 
desc = context canceled"}
   May 19 10:57:09 knode10-152-6-32 etcd[5078]: 
{"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed
 to receive watch request from gRPC stream","error":"rpc error: code = Canceled 
desc = context canceled"}
   May 19 10:57:09 knode10-152-6-32 etcd[5078]: 
{"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed
 to receive watch request from gRPC stream","error":"rpc error: code = Canceled 
desc = context canceled"}
   May 19 10:57:09 knode10-152-6-32 etcd[5078]: 
{"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed
 to receive watch request from gRPC stream","error":"rpc error: code = Canceled 
desc = context canceled"}
   May 19 10:57:09 knode10-152-6-32 etcd[5078]: 
{"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:456","msg":"failed
 to send watch control response to gRPC stream","error":"rpc error: code = 
Unavailable desc = transport is closing"}
   May 19 10:57:09 knode10-152-6-32 etcd[5078]: 
{"level":"debug","ts":"2022-05-19T10:57:09.158+0800","caller":"v3rpc/watch.go:191","msg":"failed
 to receive watch request from gRPC stream","error":"rpc error: code = Canceled 
desc = context canceled"}
   ```
   
   When I don't start apisix again, etcd does not have the above error log. 
After starting apisix for a period of time, watch-related errors begin to 
appear, and the time range is consistent with the error log in apisix error.log.
   
   My configuration file content is as follows:
   
   ```yaml
   ...
     enable_resolv_search_opt: true  # enable search option in resolv.conf
     ssl:
       enable: true
       listen:                       # APISIX listening port in https.
         - 443
       #   - port: 9444
       #     enable_http2: true      # If not set, the default value is `false`.
       #   - ip: 127.0.0.3           # Specific IP, If not set, the default 
value is `0.0.0.0`.
       #     port: 9445
       #     enable_http2: true
       enable_http2: true            # Not recommend: This parameter should be 
set via the `listen`.
       # listen_port: 9443           # Not recommend: This parameter should be 
set via the `listen`.
       ssl_trusted_certificate: /etc/etcd/etcdSSL/ca.pem  # Specifies a file 
path with trusted CA certificates in the PEM format
   ...
   etcd:
     host:                           # it's possible to define multiple etcd 
hosts addresses of the same etcd cluster.
       - "https://10.152.6.32:2379";   # multiple etcd address, if your etcd 
cluster enables TLS, please use https scheme,
       - "https://10.152.6.33:2379";   # e.g. https://127.0.0.1:2379.
       - "https://10.152.6.34:2379";
   
     prefix: /apisix                 # apisix configurations prefix
     timeout: 30                     # 30 seconds
     #resync_delay: 5                # when sync failed and a rest is needed, 
resync after the configured seconds plus 50% random jitter
     #health_check_timeout: 10       # etcd retry the unhealthy nodes after the 
configured seconds
     #user: root                     # root username for etcd
     #password: 5tHkHhYkjr6cQY       # root password for etcd
     tls:
       # To enable etcd client certificate you need to build APISIX-OpenResty, 
see
       # 
https://apisix.apache.org/docs/apisix/how-to-build/#step-6-build-openresty-for-apache-apisix
       cert: /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem    # path of 
certificate used by the etcd client
       key: /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem     # path of key 
used by the etcd client
   
       verify: false                  # whether to verify the etcd endpoint 
certificate when setup a TLS connection to etcd,
                                     # the default value is true, e.g. the 
certificate will be verified strictly.
       sni: etcdserver-apisix-staging.mfwdev.com                         # the 
SNI for etcd TLS requests. If missed, the host part of the URL will be used.
   ```
   
   My etcd certificate is private, but I'm sure my etcd certificate is fine, 
otherwise apisix wouldn't be able to communicate with etcd properly in the 
beginning.
   
   Also, when there is a problem with apisix communicating with etcd, I also 
get stuck executing `apisix init_etcd` without any output.
   
   What I want to ask is, for this kind of problem, how should I troubleshoot 
it, and what is the possible point of the problem?
   
   ### Expected Behavior
   
   _No response_
   
   ### Error Logs
   
   apisix error log: 
   
   ```
   2022/05/19 10:03:03 [warn] 13654#13654: *6638791 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:08 [warn] 13650#13650: *6635827 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:09 [warn] 13650#13650: *6643178 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] health_check.lua:90: 
report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, 
client: 10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:20 [warn] 13655#13655: *6644402 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "PUT 
/apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:20 [warn] 13650#13650: *6637242 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] health_check.lua:90: 
report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, 
client: 10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:32 [warn] 13650#13650: *6645920 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "PUT 
/apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   2022/05/19 10:03:33 [warn] 13654#13654: *6638791 [lua] v3.lua:151: 
_request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 
10.132.14.202, server: , request: "GET 
/apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: 
"10.132.63.69:9180"
   ```
   
   etcd error log: 
   
   ```
   ay 19 10:57:09 knode10-152-6-32 etcd[5078]: 
{"level":"debug","ts":"2022-05-19T10:57:09.156+0800","caller":"v3rpc/watch.go:191","msg":"failed
 to receive watch request from gRPC stream","error":"rpc error: code = Canceled 
desc = context canceled"}
   May 19 10:57:09 knode10-152-6-32 etcd[5078]: 
{"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed
 to receive watch request from gRPC stream","error":"rpc error: code = Canceled 
desc = context canceled"}
   May 19 10:57:09 knode10-152-6-32 etcd[5078]: 
{"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed
 to receive watch request from gRPC stream","error":"rpc error: code = Canceled 
desc = context canceled"}
   May 19 10:57:09 knode10-152-6-32 etcd[5078]: 
{"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed
 to receive watch request from gRPC stream","error":"rpc error: code = Canceled 
desc = context canceled"}
   May 19 10:57:09 knode10-152-6-32 etcd[5078]: 
{"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:456","msg":"failed
 to send watch control response to gRPC stream","error":"rpc error: code = 
Unavailable desc = transport is closing"}
   May 19 10:57:09 knode10-152-6-32 etcd[5078]: 
{"level":"debug","ts":"2022-05-19T10:57:09.158+0800","caller":"v3rpc/watch.go:191","msg":"failed
 to receive watch request from gRPC stream","error":"rpc error: code = Canceled 
desc = context canceled"}a
   ```
   
   ### Steps to Reproduce
   
   I'm not sure how to reproduce this issue
   
   ### Environment
   
   - APISIX version (run `apisix version`):  `2.13.1`
   - Operating system (run `uname -a`): `Linux knode10-132-15-138 
4.14.105-19-0023 #1 SMP Mon Jan 10 17:53:54 CST 2022 x86_64 x86_64 x86_64 
GNU/Linux`
   - OpenResty / Nginx version (run `openresty -V` or `nginx -V`): 
    ```
   - nginx version: openresty/1.19.9.1
   built by gcc 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)
   built with OpenSSL 1.1.1n  15 Mar 2022
   TLS SNI support enabled
   configure arguments: --prefix=/usr/local/openresty/nginx --with-cc-opt='-O2 
-DAPISIX_BASE_VER=1.19.9.1.5 -DNGX_LUA_ABORT_AT_PANIC 
-I/usr/local/openresty/zlib/include -I/usr/local/openresty/pcre/include 
-I/usr/local/openresty/openssl111/include' --add-module=../ngx_devel_kit-0.3.1 
--add-module=../echo-nginx-module-0.62 --add-module=../xss-nginx-module-0.06 
--add-module=../ngx_coolkit-0.2 --add-module=../set-misc-nginx-module-0.32 
--add-module=../form-input-nginx-module-0.12 
--add-module=../encrypted-session-nginx-module-0.08 
--add-module=../srcache-nginx-module-0.32 --add-module=../ngx_lua-0.10.20 
--add-module=../ngx_lua_upstream-0.07 
--add-module=../headers-more-nginx-module-0.33 
--add-module=../array-var-nginx-module-0.05 
--add-module=../memc-nginx-module-0.19 --add-module=../redis2-nginx-module-0.15 
--add-module=../redis-nginx-module-0.3.7 --add-module=../ngx_stream_lua-0.0.10 
--with-ld-opt='-Wl,-rpath,/usr/local/openresty/luajit/lib 
-Wl,-rpath,/usr/local/openresty/wasmtime-c-ap
 i/lib -L/usr/local/openresty/zlib/lib -L/usr/local/openresty/pcre/lib 
-L/usr/local/openresty/openssl111/lib 
-Wl,-rpath,/usr/local/openresty/zlib/lib:/usr/local/openresty/pcre/lib:/usr/local/openresty/openssl111/lib'
 --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../mod_dubbo 
--add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../ngx_multi_upstream_module
 --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../apisix-nginx-module 
--add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../apisix-nginx-module/src/stream
 --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../wasm-nginx-module 
--add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../lua-var-nginx-module 
--with-poll_module --with-pcre-jit --with-stream --with-stream_ssl_module 
--with-stream_ssl_preread_module --with-http_v2_module 
--without-mail_pop3_module --without-mail_imap_module 
--without-mail_smtp_module --with-http_stub_status_module 
--with-http_realip_module --with-http_addition_module 
--with-http_auth_request_modul
 e --with-http_secure_link_module --with-http_random_index_module 
--with-http_gzip_static_module --with-http_sub_module --with-http_dav_module 
--with-http_flv_module --with-http_mp4_module --with-http_gunzip_module 
--with-threads --with-compat --with-stream --with-http_ssl_module
   ```
   - etcd version, if relevant (run `curl 
http://127.0.0.1:9090/v1/server_info`): 
`{"etcdserver":"3.5.4","etcdcluster":"3.5.0"}`
   - APISIX Dashboard version, if relevant:
   - Plugin runner version, for issues related to plugin runners:
   - LuaRocks version, for installation issues (run `luarocks --version`):
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [apisix] hansedong opened a new issue, #7078: bug: when apisix starts for a while, its communication with etcd starts to time out

Reply via email to