Copilot commented on code in PR #12968:
URL: https://github.com/apache/apisix/pull/12968#discussion_r2802820380


##########
apisix/plugins/ai-proxy-multi.lua:
##########
@@ -35,7 +35,7 @@ local endpoint_regex = "^(https?)://([^:/]+):?(%d*)/?.*$"
 
 local pickers = {}

Review Comment:
   The TTL reduction from 300 to 10 seconds significantly increases SHM access 
frequency. Consider documenting why this aggressive TTL is needed, especially 
since status_ver changes should already invalidate the cache when health states 
change.
   ```suggestion
   local pickers = {}
   -- NOTE:
   -- The TTL here is intentionally kept small (10s) instead of the more 
typical 300s.
   -- Health state changes are already handled via status_ver-based 
invalidation in the
   -- healthcheck manager, but this cache also needs to reflect other dynamic 
changes
   -- (for example, configuration updates, priority/weight adjustments, or 
backend
   -- endpoint rotation) in a timely manner for AI traffic routing.
   --
   -- Using a 10s TTL trades slightly higher SHM access frequency for faster 
convergence
   -- towards the latest upstream selection state, which is acceptable for this 
plugin's
   -- usage pattern. If you change this value, consider the impact on both 
performance
   -- and how quickly picker state reflects configuration and routing changes.
   ```



##########
apisix/plugins/ai-proxy-multi.lua:
##########
@@ -229,32 +225,75 @@ local function get_checkers_status_ver(checkers)
 end
 
 
+-- Read health status directly from SHM to get consistent state across all 
workers
+-- This bypasses the local cache which can be stale due to async worker events
+local function fetch_health_status_from_shm(shm, checker_name, ip, port, 
hostname, instance_name)
+    local lookup_hostname = hostname or ip
+    local state_key = string.format("lua-resty-healthcheck:%s:state:%s:%s:%s",
+        checker_name,
+        ip,
+        port,
+        lookup_hostname)
+
+    core.log.info("[SHM-DIRECT] instance=", instance_name, " key=", state_key)
+    local state = shm:get(state_key)
+    if state then
+        -- State values: 1=healthy, 2=unhealthy, 3=mostly_healthy, 
4=mostly_unhealthy
+        local ok = (state == 1 or state == 3)
+        core.log.info("[SHM-DIRECT] instance=", instance_name, " state=", 
state, " ok=", ok)
+        return ok, state
+    end
+
+    -- State not found in SHM (checker not yet created), default to healthy
+    core.log.warn("[SHM-DIRECT] state not found for instance=", instance_name, 
", defaulting to healthy")
+    return true, nil

Review Comment:
   Defaulting to healthy when state is not found contradicts the stated goal of 
excluding unhealthy instances. This could allow requests to uninitialized 
instances. Consider defaulting to unhealthy until the first health check 
completes, or document why healthy is the correct default.
   ```suggestion
       -- State not found in SHM (checker not yet created), default to 
unhealthy to avoid routing to uninitialized instances
       core.log.warn("[SHM-DIRECT] state not found for instance=", 
instance_name, ", defaulting to unhealthy")
       return false, nil
   ```



##########
apisix/plugins/ai-proxy-multi.lua:
##########
@@ -302,23 +341,30 @@ end
 function _M.construct_upstream(instance)
     local upstream = {}
     local node = instance._dns_value
+
+    -- If _dns_value doesn't exist (e.g., when called from timer),
+    -- calculate it from the instance config
     if not node then
-        return nil, "failed to resolve endpoint for instance: " .. 
instance.name
+        core.log.info("instance._dns_value not found, calculating from config 
for instance: ", instance.name)
+        node = calculate_dns_node(instance)
+        if not node then
+            return nil, "failed to calculate endpoint for instance: " .. 
instance.name
+        end
     end
 
     if not node.host or not node.port then
         return nil, "invalid upstream node: " .. core.json.encode(node)
     end
 
-    local node = {
+    local upstream_node = {

Review Comment:
   The variable name `upstream_node` shadows the outer scope variable `node`. 
While functionally correct, this could be confusing. Consider renaming to 
something like `node_config` to distinguish the constructed upstream node from 
the DNS-resolved node.



##########
apisix/plugins/ai-proxy-multi.lua:
##########
@@ -177,7 +170,9 @@ local function parse_domain_for_node(node)
 end
 
 
-local function resolve_endpoint(instance_conf)
+-- Calculate DNS node from instance config without modifying the input
+-- Returns a node table with host, port, scheme fields

Review Comment:
   The function comment should clarify that this is intended for use when 
_dns_value is not available (e.g., when called from timer context), as this is 
a critical use case mentioned in the PR description.
   ```suggestion
   -- Calculate DNS node from instance config without modifying the input.
   -- Intended for use when _dns_value is not available (e.g., when called
   -- from timer context) to recompute the target node.
   -- Returns a node table with host, port, scheme fields.
   ```



##########
apisix/plugins/ai-proxy-multi.lua:
##########
@@ -229,32 +225,75 @@ local function get_checkers_status_ver(checkers)
 end
 
 
+-- Read health status directly from SHM to get consistent state across all 
workers
+-- This bypasses the local cache which can be stale due to async worker events
+local function fetch_health_status_from_shm(shm, checker_name, ip, port, 
hostname, instance_name)
+    local lookup_hostname = hostname or ip
+    local state_key = string.format("lua-resty-healthcheck:%s:state:%s:%s:%s",
+        checker_name,
+        ip,
+        port,
+        lookup_hostname)
+
+    core.log.info("[SHM-DIRECT] instance=", instance_name, " key=", state_key)
+    local state = shm:get(state_key)
+    if state then
+        -- State values: 1=healthy, 2=unhealthy, 3=mostly_healthy, 
4=mostly_unhealthy
+        local ok = (state == 1 or state == 3)
+        core.log.info("[SHM-DIRECT] instance=", instance_name, " state=", 
state, " ok=", ok)
+        return ok, state
+    end
+
+    -- State not found in SHM (checker not yet created), default to healthy
+    core.log.warn("[SHM-DIRECT] state not found for instance=", instance_name, 
", defaulting to healthy")
+    return true, nil
+end
+
+
+-- Get SHM and checker_name for an instance, returns (shm, checker_name)
+local function get_shm_info(checkers, conf, i, ins)
+    local checker = checkers and checkers[ins.name]
+
+    -- Prefer the instance's own checker
+    if checker and checker.shm then
+        core.log.info("[SHM-DEBUG] instance=", ins.name, " using own checker")
+        return checker.shm, checker.name
+    end
+
+    -- Fallback: use another checker's SHM and construct the checker_name
+    local checker_ref = checkers and next(checkers)

Review Comment:
   `next(checkers)` returns (key, value), so `checker_ref` is actually the key 
(instance name), not the checker object. This code attempts to access 
`checker_ref.shm` which would fail. Should be: `local _, checker_ref = 
next(checkers)`
   ```suggestion
       local _, checker_ref = checkers and next(checkers)
   ```



##########
apisix/plugins/ai-proxy-multi.lua:
##########
@@ -229,32 +225,75 @@ local function get_checkers_status_ver(checkers)
 end
 
 
+-- Read health status directly from SHM to get consistent state across all 
workers
+-- This bypasses the local cache which can be stale due to async worker events
+local function fetch_health_status_from_shm(shm, checker_name, ip, port, 
hostname, instance_name)
+    local lookup_hostname = hostname or ip
+    local state_key = string.format("lua-resty-healthcheck:%s:state:%s:%s:%s",
+        checker_name,
+        ip,
+        port,
+        lookup_hostname)
+
+    core.log.info("[SHM-DIRECT] instance=", instance_name, " key=", state_key)
+    local state = shm:get(state_key)
+    if state then
+        -- State values: 1=healthy, 2=unhealthy, 3=mostly_healthy, 
4=mostly_unhealthy
+        local ok = (state == 1 or state == 3)
+        core.log.info("[SHM-DIRECT] instance=", instance_name, " state=", 
state, " ok=", ok)
+        return ok, state
+    end
+
+    -- State not found in SHM (checker not yet created), default to healthy
+    core.log.warn("[SHM-DIRECT] state not found for instance=", instance_name, 
", defaulting to healthy")
+    return true, nil
+end
+
+
+-- Get SHM and checker_name for an instance, returns (shm, checker_name)
+local function get_shm_info(checkers, conf, i, ins)
+    local checker = checkers and checkers[ins.name]
+
+    -- Prefer the instance's own checker
+    if checker and checker.shm then
+        core.log.info("[SHM-DEBUG] instance=", ins.name, " using own checker")
+        return checker.shm, checker.name
+    end
+
+    -- Fallback: use another checker's SHM and construct the checker_name
+    local checker_ref = checkers and next(checkers)
+    if checker_ref and checker_ref.shm then

Review Comment:
   The array index calculation `(i - 1)` suggests Lua's 1-based indexing is 
being converted to 0-based. Add a comment explaining this is for compatibility 
with the checker naming convention, as it's not immediately obvious why the 
adjustment is needed.
   ```suggestion
       if checker_ref and checker_ref.shm then
           -- Note: (i - 1) converts Lua's 1-based index to 0-based to match the
           -- existing checker naming convention used when the health checker 
is created.
   ```



##########
apisix/plugins/ai-proxy-multi.lua:
##########
@@ -188,22 +183,23 @@ local function resolve_endpoint(instance_conf)
         port = tonumber(port)
     else
         local ai_driver = require("apisix.plugins.ai-drivers." .. 
instance_conf.provider)
-        if ai_driver.get_node then
-            local node = ai_driver.get_node(instance_conf)
-            host = node.host
-            port = node.port
-        else
-            host = ai_driver.host
-            port = ai_driver.port
-        end
+        -- built-in ai driver always use https
         scheme = "https"
+        host = ai_driver.host
+        port = ai_driver.port

Review Comment:
   The removed code handled `ai_driver.get_node()` which appears to be a valid 
driver interface. This removal could break custom AI drivers that implement 
get_node(). Consider preserving this logic or documenting why it was removed.
   ```suggestion
           -- built-in ai driver always use https; custom drivers may implement 
get_node()
           if ai_driver.get_node then
               local driver_node = ai_driver.get_node(instance_conf)
               if driver_node then
                   scheme = driver_node.scheme or "https"
                   host = driver_node.host
                   port = driver_node.port
               else
                   scheme = "https"
                   host = ai_driver.host
                   port = ai_driver.port
               end
           else
               scheme = "https"
               host = ai_driver.host
               port = ai_driver.port
           end
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to