hanzhenfang commented on issue #13474:
URL: https://github.com/apache/apisix/issues/13474#issuecomment-4670722718

   Hi, @Baoyuantop  I built a smaller mock-only reproduction for the 
intermittent `ai-proxy` timeout. It does not call any real LLM provider.
   
   The mock upstream randomly chooses one behavior for each new TCP connection: 
proxy to a local HTTPS OpenAI-compatible mock and return HTTP 200, or hang 
before TLS response so APISIX times out inside `httpc:connect()`.
   
   The `ai-proxy` route below sets `keepalive: false` intentionally, so every 
request opens a fresh upstream TLS connection. This makes the intermittent 
connect/TLS-handshake failure reproducible in a short local test. With 
keepalive enabled, one successful connection can be reused and hide later 
connect attempts.
   
   ## Files
   
   ### docker-compose.yaml
   
   ```yaml
   name: issue-13474
   
   services:
     apisix:
       image: ${APISIX_IMAGE:-apache/apisix:3.16.0-debian}
       container_name: issue-13474-apisix
       restart: "no"
       depends_on:
         - llm-random-tls
       environment:
         APISIX_STAND_ALONE: "true"
       volumes:
         - ./config.yaml:/usr/local/apisix/conf/config.yaml:ro
         - ./apisix.yaml:/usr/local/apisix/conf/apisix.yaml:ro
       ports:
         - "19080:9080"
         - "19090:9090"
   
     llm-random-tls:
       image: apache/apisix:3.16.0-debian
       container_name: issue-13474-llm-random-tls
       restart: "no"
       environment:
         MOCK_FAIL_RATE: "30"
       volumes:
         - ./mock-random-tls-nginx.conf:/tmp/mock-random-tls-nginx.conf:ro
       command:
         - sh
         - -c
         - >
           mkdir -p /tmp/random-tls/logs &&
           openssl req -x509 -newkey rsa:2048 -nodes -days 1
           -subj '/CN=llm-random-tls'
           -keyout /tmp/random-tls/server.key
           -out /tmp/random-tls/server.crt &&
           openresty -p /tmp/random-tls -c /tmp/mock-random-tls-nginx.conf -g 
'daemon off;'
   ```
   
   ### config.yaml
   
   ```yaml
   apisix:
     node_listen:
       - 9080
     enable_admin: false
   
   nginx_config:
     error_log_level: info
     worker_processes: 1
   
   deployment:
     role: data_plane
     role_data_plane:
       config_provider: yaml
   
   control:
     ip: 0.0.0.0
     port: 9090
   ```
   
   ### apisix.yaml
   
   ```yaml
   routes:
     - id: 1
       uri: /random-tls/v1/chat/completions
       methods:
         - POST
       plugins:
         ai-proxy:
           provider: openai-compatible
           timeout: 1000
           keepalive: false
           keepalive_timeout: 60000
           keepalive_pool: 30
           ssl_verify: false
           auth:
             header:
               Authorization: Bearer mock-random-tls-key
           options:
             model: random-tls-mock
           override:
             endpoint: https://llm-random-tls:9444/v1/chat/completions
   #END
   ```
   
   ### mock-random-tls-nginx.conf
   
   ```nginx
   env MOCK_FAIL_RATE;
   
   worker_processes 1;
   error_log /dev/stderr info;
   pid /tmp/random-tls/nginx.pid;
   
   events {
       worker_connections 1024;
   }
   
   stream {
       server {
           listen 9444;
   
           preread_by_lua_block {
               local fail_rate = tonumber(os.getenv("MOCK_FAIL_RATE") or "30") 
or 30
               local remote_port = tonumber(ngx.var.remote_port or "0") or 0
               local seed = math.floor(ngx.now() * 1000000) + ngx.worker.pid() 
+ remote_port
   
               math.randomseed(seed)
   
               local r = math.random(100)
               if r <= fail_rate then
                   ngx.log(ngx.WARN, "random TLS mock selected hang before TLS 
response, r=", r, ", fail_rate=", fail_rate)
                   ngx.sleep(60)
                   return ngx.exit(ngx.ERROR)
               end
   
               ngx.log(ngx.WARN, "random TLS mock selected normal TLS upstream, 
r=", r, ", fail_rate=", fail_rate)
           }
   
           proxy_pass 127.0.0.1:9445;
       }
   }
   
   http {
       access_log /dev/stdout;
   
       server {
           listen 9445 ssl;
           default_type application/json;
   
           ssl_certificate /tmp/random-tls/server.crt;
           ssl_certificate_key /tmp/random-tls/server.key;
   
           location = /v1/chat/completions {
               content_by_lua_block {
                   ngx.req.read_body()
   
                   local cjson = require("cjson.safe")
   
                   ngx.header["Content-Type"] = "application/json"
                   ngx.header["X-Mock-LLM"] = "random-tls"
   
                   local body = {
                       id = "chatcmpl-random-tls-mock",
                       object = "chat.completion",
                       created = ngx.time(),
                       model = "random-tls-mock",
                       choices = {
                           {
                               index = 0,
                               message = {
                                   role = "assistant",
                                   content = "mock response from random-tls",
                               },
                               finish_reason = "stop",
                           },
                       },
                       usage = {
                           prompt_tokens = 3,
                           completion_tokens = 4,
                           total_tokens = 7,
                       },
                   }
   
                   ngx.say(cjson.encode(body))
               }
           }
       }
   }
   ```
   
   ## Test steps
   
   1. Save the four files above in the same directory.
   2. Start APISIX 3.16.0: `docker compose up -d --force-recreate 
--remove-orphans && docker compose exec -T apisix apisix version`
   3. Send requests through APISIX:
   
   ```sh
   for i in $(seq 1 40); do
     curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' -m 5 \
       http://127.0.0.1:19080/random-tls/v1/chat/completions \
       -H 'Content-Type: application/json' \
       -d '{"messages":[{"role":"user","content":"ping"}]}'
   done
   ```
   
   4. Check timeout logs: `docker compose logs --since 2m apisix | grep 
'openai-base.lua:366: failed to connect to LLM server: timeout'`
   
   ## Observed
   
   With APISIX `3.16.0`, a 40-request run produced mixed 200 and 504 responses. 
One observed run was `attempts=40 ok=33 timeout_504=7 other=0`.
   
   I also reran the exact comment version with 10 batches of 40 requests. The 
total result was `attempts=400 ok=288 timeout_504=112 other=0`, so the observed 
reproduction rate was about `28%` with `MOCK_FAIL_RATE: "30"`.
   
   The APISIX log count matched the timeout count:
   
   ```text
   openai-base.lua:366: failed to connect to LLM server: timeout -> 112
   ```
   
   The mock upstream log count also matched:
   
   ```text
   hang before TLS response -> 112
   normal TLS upstream     -> 288
   ```
   
   Representative APISIX log:
   
   ```text
   openai-base.lua:366: failed to connect to LLM server: timeout, request: 
"POST /random-tls/v1/chat/completions HTTP/1.1"
   ```
   
   I also tested the same mock against APISIX `3.14.1`. It also timed out with 
the same upstream behavior, but the log line was the old one:
   
   ```text
   attempts=40 ok=28 timeout_504=12 other=0
   openai-base.lua:229: failed to connect to LLM server: timeout -> 12
   ```
   
   So this reproduces the intermittent 3.16 log line from the issue with a 
simulated LLM upstream. It does not prove a 3.16-only regression by itself, 
because the same upstream TLS-handshake stall also times out in 3.14.1 with the 
old line number.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to