Re: [PR] docs: update `ai-proxy` doc and `ai-proxy-multi` plugin doc [apisix]

via GitHub Thu, 27 Mar 2025 18:55:24 -0700


Yilialinn commented on code in PR #12094:
URL: https://github.com/apache/apisix/pull/12094#discussion_r2017801597



##########
docs/en/latest/plugins/ai-proxy-multi.md:
##########
@@ -27,215 +29,977 @@ description: This document contains information about the 
Apache APISIX ai-proxy
 #
 -->
 
-## Description
+<head>
+  <link rel="canonical" href="https://docs.api7.ai/hub/ai-proxy-multi"; />
+</head>
 
-The `ai-proxy-multi` plugin simplifies access to LLM providers and models by 
defining a standard request format
-that allows key fields in plugin configuration to be embedded into the request.
+## Description
 
-This plugin adds additional features like `load balancing` and `retries` to 
the existing `ai-proxy` plugin.
+The `ai-proxy-multi` Plugin simplifies access to LLM and embedding models by 
transforming Plugin configurations into the designated request format for 
OpenAI, DeepSeek, and other OpenAI-compatible APIs. It extends the capabilities 
of [`ai-proxy-multi`](./ai-proxy.md) with load balancing, retries, fallbacks, 
and health checks.
 
-Proxying requests to OpenAI is supported now. Other LLM services will be 
supported soon.
+In addition, the Plugin also supports logging LLM request information in the 
access log, such as token usage, model, time to the first response, and more.
 
 ## Request Format
 
-### OpenAI
-
-- Chat API
-
 | Name               | Type   | Required | Description                         
                |
 | ------------------ | ------ | -------- | 
--------------------------------------------------- |
-| `messages`         | Array  | Yes      | An array of message objects         
                |
-| `messages.role`    | String | Yes      | Role of the message (`system`, 
`user`, `assistant`) |
-| `messages.content` | String | Yes      | Content of the message              
                |
-
-## Plugin Attributes
-
-| **Name**                     | **Required** | **Type** | **Description**     
                                                                                
          | **Default** |
-| ---------------------------- | ------------ | -------- | 
-------------------------------------------------------------------------------------------------------------
 | ----------- |
-| providers                    | Yes          | array    | List of AI 
providers, each following the provider schema.                                  
                   |             |
-| provider.name                | Yes          | string   | Name of the AI 
service provider. Allowed values: `openai`, `deepseek`.                         
               |             |
-| provider.model               | Yes          | string   | Name of the AI 
model to execute. Example: `gpt-4o`.                                            
               |             |
-| provider.priority            | No           | integer  | Priority of the 
provider for load balancing.                                                    
              | 0           |
-| provider.weight              | No           | integer  | Load balancing 
weight.                                                                         
               |             |
-| balancer.algorithm           | No           | string   | Load balancing 
algorithm. Allowed values: `chash`, `roundrobin`.                               
               | roundrobin  |
-| balancer.hash_on             | No           | string   | Defines what to 
hash on for consistent hashing (`vars`, `header`, `cookie`, `consumer`, 
`vars_combinations`). | vars        |
-| balancer.key                 | No           | string   | Key for consistent 
hashing in dynamic load balancing.                                              
           |             |
-| provider.auth                | Yes          | object   | Authentication 
details, including headers and query parameters.                                
               |             |
-| provider.auth.header         | No           | object   | Authentication 
details sent via headers. Header name must match `^[a-zA-Z0-9._-]+$`.           
               |             |
-| provider.auth.query          | No           | object   | Authentication 
details sent via query parameters. Keys must match `^[a-zA-Z0-9._-]+$`.         
               |             |
-| provider.override.endpoint   | No           | string   | Custom host 
override for the AI provider.                                                   
                  |             |
-| timeout                      | No           | integer  | Request timeout in 
milliseconds (1-60000).                                                         
           | 30000        |
-| keepalive                    | No           | boolean  | Enables keepalive 
connections.                                                                    
            | true        |
-| keepalive_timeout            | No           | integer  | Timeout for 
keepalive connections (minimum 1000ms).                                         
                  | 60000       |
-| keepalive_pool               | No           | integer  | Maximum keepalive 
connections.                                                                    
            | 30          |
-| ssl_verify                   | No           | boolean  | Enables SSL 
certificate verification.                                                       
                  | true        |
-
-## Example usage
-
-Create a route with the `ai-proxy-multi` plugin like so:
+| `messages`         | Array  | True      | An array of message objects.       
                 |
+| `messages.role`    | String | True      | Role of the message (`system`, 
`user`, `assistant`).|
+| `messages.content` | String | True      | Content of the message.            
                 |
+
+## Attributes
+
+| Name                               | Type            | Required | Default    
                       | Valid Values | Description |
+|------------------------------------|----------------|----------|-----------------------------------|--------------|-------------|
+| fallback_strategy                  | string         | False    | 
instance_health_and_rate_limiting | instance_health_and_rate_limiting | 
Fallback strategy. When set, the Plugin will check whether the specified 
instance’s token has been exhausted when a request is forwarded. If so, forward 
the request to the next instance regardless of the instance priority. When not 
set, the Plugin will not forward the request to low priority instances when 
token of the high priority instance is exhausted. |
+| balancer                           | object         | False    |             
                      |              | Load balancing configurations. |
+| balancer.algorithm                 | string         | False    | roundrobin  
                   | [roundrobin, chash] | Load balancing algorithm. When set 
to `roundrobin`, weighted round robin algorithm is used. When set to `chash`, 
consistent hashing algorithm is used. |
+| balancer.hash_on                   | string         | False    |             
                      | [vars, headers, cookie, consumer, vars_combinations] | 
Used when `type` is `chash`. Support hashing on [NGINX 
variables](https://nginx.org/en/docs/varindex.html), headers, cookie, consumer, 
or a combination of [NGINX variables](https://nginx.org/en/docs/varindex.html). 
|
+| balancer.key                       | string         | False    |             
                      |              | Used when `type` is `chash`. When 
`hash_on` is set to `header` or `cookie`, `key` is required. When `hash_on` is 
set to `consumer`, `key` is not required as the consumer name will be used as 
the key automatically. |
+| instances                          | array[object]  | True     |             
                      |              | LLM instance configurations. |
+| instances.name                     | string         | True     |             
                      |              | Name of the LLM service instance. |
+| instances.provider                 | string         | True     |             
                      | [openai, deepseek, openai-compatible] | LLM service 
provider. When set to `openai`, the Plugin will proxy the request to 
`api.openai.com`. When set to `deepseek`, the Plugin will proxy the request to 
`api.deepseek.com`. When set to `openai-compatible`, the Plugin will proxy the 
request to the custom endpoint configured in `override`. |
+| instances.priority                  | integer        | False    | 0          
                     |              | Priority of the LLM instance in load 
balancing. `priority` takes precedence over `weight`. |
+| instances.weight                    | string         | True     | 0          
                     | greater or equal to 0 | Weight of the LLM instance in 
load balancing. |
+| instances.auth                      | object         | True     |            
                       |              | Authentication configurations. |
+| instances.auth.header               | object         | False    |            
                       |              | Authentication headers. At least one of 
the `header` and `query` should be configured. |
+| instances.auth.query                | object         | False    |            
                       |              | Authentication query parameters. At 
least one of the `header` and `query` should be configured. |
+| instances.options                   | object         | False    |            
                       |              | Model configurations. In addition to 
`model`, you can configure additional parameters and they will be forwarded to 
the upstream LLM service in the request body. For instance, if you are working 
with OpenAI or DeepSeek, you can configure additional parameters such as 
`max_tokens`, `temperature`, `top_p`, and `stream`. See your LLM provider's API 
documentation for more available options. |
+| instances.options.model             | string         | False    |            
                       |              | Name of the LLM model, such as `gpt-4` 
or `gpt-3.5`. See your LLM provider's API documentation for more available 
models. |
+| logging                             | object         | False    |            
                       |              | Logging configurations. |
+| logging.summaries                   | boolean        | False    | false      
                     |              | If true, log request LLM model, duration, 
request, and response tokens. |
+| logging.payloads                    | boolean        | False    | false      
                     |              | If true, log request and response 
payload. |
+| logging.override                    | object         | False    |            
                       |              | Override setting. |
+| logging.override.endpoint           | string         | False    |            
                       |              | LLM provider endpoint to replace the 
default endpoint with. If not configured, the Plugin uses the default OpenAI 
endpoint `https://api.openai.com/v1/chat/completions`. |
+| checks                              | object         | False    |            
                       |              | Health check configurations. Note that 
at the moment, OpenAI and DeepSeek do not provide an official health check 
endpoint. Other LLM services that you can configure under `openai-compatible` 
provider may have available health check endpoints. |
+| checks.active                       | object         | True     |            
                       |              | Active health check configurations. |
+| checks.active.type                  | string         | False    | http       
                     | [http, https, tcp] | Type of health check connection. |
+| checks.active.timeout               | number         | False    | 1          
                     |              | Health check timeout in seconds. |
+| checks.active.concurrency           | integer        | False    | 10         
                     |              | Number of upstream nodes to be checked at 
the same time. |
+| checks.active.host                  | string         | False    |            
                       |              | HTTP host. |
+| checks.active.port                  | integer        | False    |            
                       | between 1 and 65535 inclusive | HTTP port. |
+| checks.active.http_path             | string         | False    | /          
                     |              | Path for HTTP probing requests. |
+| checks.active.https_verify_certificate | boolean   | False    | true         
                   |              | If true, verify the node's TLS certificate. 
|
+| timeout                             | integer        | False    | 30000      
                     | greater than or equal to 1 | Request timeout in 
milliseconds when requesting the LLM service. |
+| keepalive                           | boolean        | False    | true       
                     |              | If true, keep the connection alive when 
requesting the LLM service. |
+| keepalive_timeout                   | integer        | False    | 60000      
                     | greater than or equal to 1000 | Request timeout in 
milliseconds when requesting the LLM service. |
+| keepalive_pool                      | integer        | False    | 30         
                     |              | Keepalive pool size for when connecting 
with the LLM service. |
+| ssl_verify                          | boolean        | False    | true       
                     |              | If true, verify the LLM service's 
certificate. |
+
+## Examples
+
+The examples below demonstrate how you can configure `ai-proxy-multi` for 
different scenarios.
+
+:::note
+
+You can fetch the `admin_key` from `config.yaml` and save to an environment 
variable with the following command:
+
+```bash
+admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 
's/"//g')
+```
+
+:::
+
+### Load Balance between Instances
+
+The following example demonstrates how you can configure two models for load 
balancing, forwarding 80% of the traffic to one instance and 20% to the other.
+
+For demonstration and easier differentiation, you will be configuring one 
OpenAI instance and one DeepSeek instance as the upstream LLM services.
+
+Create a Route as such and update with your LLM providers, models, API keys, 
and endpoints if applicable:
 
 ```shell
 curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
-  -H "X-API-KEY: ${ADMIN_API_KEY}" \
+  -H "X-API-KEY: ${admin_key}" \
   -d '{
     "id": "ai-proxy-multi-route",
     "uri": "/anything",
     "methods": ["POST"],
     "plugins": {
       "ai-proxy-multi": {
-        "providers": [
+        "instances": [
           {
-            "name": "openai",
-            "model": "gpt-4",
-            "weight": 1,
-            "priority": 1,
+            "name": "openai-instance",
+            "provider": "openai",
+            "weight": 8,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$OPENAI_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "gpt-4"
             }
           },
           {
-            "name": "deepseek",
-            "model": "deepseek-chat",
-            "weight": 1,
+            "name": "deepseek-instance",
+            "provider": "deepseek",
+            "weight": 2,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "deepseek-chat"
             }
           }
         ]
       }
-    },
-    "upstream": {
-      "type": "roundrobin",
-      "nodes": {
-        "httpbin.org": 1
-      }
     }
   }'
 ```
 
-In the above configuration, requests will be equally balanced among the 
`openai` and `deepseek` providers.
+Send 10 POST requests to the Route with a system prompt and a sample user 
question in the request body, to see the number of requests forwarded to OpenAI 
and DeepSeek:
 
-### Retry and fallback:
+```shell
+openai_count=0
+deepseek_count=0
+
+for i in {1..10}; do
+  model=$(curl -s "http://127.0.0.1:9080/anything"; -X POST \
+    -H "Content-Type: application/json" \
+    -d '{
+      "messages": [
+        { "role": "system", "content": "You are a mathematician" },
+        { "role": "user", "content": "What is 1+1?" }
+      ]
+    }' | jq -r '.model')
+
+  if [[ "$model" == *"gpt-4"* ]]; then
+    ((openai_count++))
+  elif [[ "$model" == "deepseek-chat" ]]; then
+    ((deepseek_count++))
+  fi
+done
+
+echo "OpenAI responses: $openai_count"
+echo "DeepSeek responses: $deepseek_count"
+```
+
+You should see a response similar to the following:
+
+```text
+OpenAI responses: 8
+DeepSeek responses: 2
+```
+
+### Configure Instance Priority and Rate Limiting
+
+The following example demonstrates how you can configure two models with 
different priorities and apply rate limiting on the instance with a higher 
priority. In the case where `fallback_strategy` is set to 
`instance_health_and_rate_limiting`, the Plugin should continue to forward 
requests to the low priority instance once the high priority instance's rate 
limiting quota is fully consumed.
 
-The `priority` attribute can be adjusted to implement the fallback and retry 
feature.
+Create a Route as such and update with your LLM providers, models, API keys, 
and endpoints if applicable:
 
 ```shell
 curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
-  -H "X-API-KEY: ${ADMIN_API_KEY}" \
+  -H "X-API-KEY: ${admin_key}" \
   -d '{
     "id": "ai-proxy-multi-route",
     "uri": "/anything",
     "methods": ["POST"],
     "plugins": {
       "ai-proxy-multi": {
-        "providers": [
+        "fallback_strategy: "instance_health_and_rate_limiting",
+        "instances": [
           {
-            "name": "openai",
-            "model": "gpt-4",
-            "weight": 1,
+            "name": "openai-instance",
+            "provider": "openai",
             "priority": 1,
+            "weight": 0,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$OPENAI_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "gpt-4"
             }
           },
           {
-            "name": "deepseek",
-            "model": "deepseek-chat",
-            "weight": 1,
+            "name": "deepseek-instance",
+            "provider": "deepseek",
             "priority": 0,
+            "weight": 0,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "deepseek-chat"
             }
           }
         ]
+      },
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "limit_strategy": "total_tokens"
       }
+    }
+  }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "1+1 equals 2.",
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 8,
+    "total_tokens": 31,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
     },
-    "upstream": {
-      "type": "roundrobin",
-      "nodes": {
-        "httpbin.org": 1
+    "completion_tokens_details": {
+      "reasoning_tokens": 0,
+      "audio_tokens": 0,
+      "accepted_prediction_tokens": 0,
+      "rejected_prediction_tokens": 0
+    }
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `10`, the next 
request within the 60-second window is expected to be forwarded to the other 
instance.
+
+Within the same 60-second window, send another POST request to the route:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newton law" }
+    ]
+  }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Certainly! Newton's laws of motion are three fundamental 
principles that describe the relationship between the motion of an object and 
the forces acting on it. They were formulated by Sir Isaac Newton in the late 
17th century and are foundational to classical mechanics.\n\n---\n\n### **1. 
Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will 
remain at rest, and an object in motion will continue moving at a constant 
velocity (in a straight line at a constant speed), unless acted upon by an 
external force.\n- **Key Idea:** This law introduces the concept of 
**inertia**, which is the tendency of an object to resist changes in its state 
of motion.\n- **Example:** If you slide a book across a table, it eventually 
stops because of the force of friction acting on it. Without friction, the book 
would keep moving indefinitely.\n\n---\n\n### **2. Newton's Second Law (Law of 
Acceleration):**\n- **Statement:** The acceleration of an object is direct
 ly proportional to the net force acting on it and inversely proportional to 
its mass. Mathematically, this is expressed as:\n  \\[\n  F = ma\n  \\]\n  
where:\n  - \\( F \\) = net force applied (in Newtons),\n  -"
+      },
+      ...
+    }
+  ],
+  ...
+}
+```
+
+### Load Balance and Rate Limit by Consumers
+
+The following example demonstrates how you can configure two models for load 
balancing and apply rate limiting by consumer.
+
+Create a consumer `johndoe` and a rate limiting quota of 10 tokens in a 
60-second window on `openai-instance` instance:

Review Comment:
   ```suggestion
   Create a Consumer `johndoe` and a rate limiting quota of 10 tokens in a 
60-second window on `openai-instance` instance:
   ```



##########
docs/en/latest/plugins/ai-proxy-multi.md:
##########
@@ -27,215 +29,977 @@ description: This document contains information about the 
Apache APISIX ai-proxy
 #
 -->
 
-## Description
+<head>
+  <link rel="canonical" href="https://docs.api7.ai/hub/ai-proxy-multi"; />
+</head>
 
-The `ai-proxy-multi` plugin simplifies access to LLM providers and models by 
defining a standard request format
-that allows key fields in plugin configuration to be embedded into the request.
+## Description
 
-This plugin adds additional features like `load balancing` and `retries` to 
the existing `ai-proxy` plugin.
+The `ai-proxy-multi` Plugin simplifies access to LLM and embedding models by 
transforming Plugin configurations into the designated request format for 
OpenAI, DeepSeek, and other OpenAI-compatible APIs. It extends the capabilities 
of [`ai-proxy-multi`](./ai-proxy.md) with load balancing, retries, fallbacks, 
and health checks.
 
-Proxying requests to OpenAI is supported now. Other LLM services will be 
supported soon.
+In addition, the Plugin also supports logging LLM request information in the 
access log, such as token usage, model, time to the first response, and more.
 
 ## Request Format
 
-### OpenAI
-
-- Chat API
-
 | Name               | Type   | Required | Description                         
                |
 | ------------------ | ------ | -------- | 
--------------------------------------------------- |
-| `messages`         | Array  | Yes      | An array of message objects         
                |
-| `messages.role`    | String | Yes      | Role of the message (`system`, 
`user`, `assistant`) |
-| `messages.content` | String | Yes      | Content of the message              
                |
-
-## Plugin Attributes
-
-| **Name**                     | **Required** | **Type** | **Description**     
                                                                                
          | **Default** |
-| ---------------------------- | ------------ | -------- | 
-------------------------------------------------------------------------------------------------------------
 | ----------- |
-| providers                    | Yes          | array    | List of AI 
providers, each following the provider schema.                                  
                   |             |
-| provider.name                | Yes          | string   | Name of the AI 
service provider. Allowed values: `openai`, `deepseek`.                         
               |             |
-| provider.model               | Yes          | string   | Name of the AI 
model to execute. Example: `gpt-4o`.                                            
               |             |
-| provider.priority            | No           | integer  | Priority of the 
provider for load balancing.                                                    
              | 0           |
-| provider.weight              | No           | integer  | Load balancing 
weight.                                                                         
               |             |
-| balancer.algorithm           | No           | string   | Load balancing 
algorithm. Allowed values: `chash`, `roundrobin`.                               
               | roundrobin  |
-| balancer.hash_on             | No           | string   | Defines what to 
hash on for consistent hashing (`vars`, `header`, `cookie`, `consumer`, 
`vars_combinations`). | vars        |
-| balancer.key                 | No           | string   | Key for consistent 
hashing in dynamic load balancing.                                              
           |             |
-| provider.auth                | Yes          | object   | Authentication 
details, including headers and query parameters.                                
               |             |
-| provider.auth.header         | No           | object   | Authentication 
details sent via headers. Header name must match `^[a-zA-Z0-9._-]+$`.           
               |             |
-| provider.auth.query          | No           | object   | Authentication 
details sent via query parameters. Keys must match `^[a-zA-Z0-9._-]+$`.         
               |             |
-| provider.override.endpoint   | No           | string   | Custom host 
override for the AI provider.                                                   
                  |             |
-| timeout                      | No           | integer  | Request timeout in 
milliseconds (1-60000).                                                         
           | 30000        |
-| keepalive                    | No           | boolean  | Enables keepalive 
connections.                                                                    
            | true        |
-| keepalive_timeout            | No           | integer  | Timeout for 
keepalive connections (minimum 1000ms).                                         
                  | 60000       |
-| keepalive_pool               | No           | integer  | Maximum keepalive 
connections.                                                                    
            | 30          |
-| ssl_verify                   | No           | boolean  | Enables SSL 
certificate verification.                                                       
                  | true        |
-
-## Example usage
-
-Create a route with the `ai-proxy-multi` plugin like so:
+| `messages`         | Array  | True      | An array of message objects.       
                 |
+| `messages.role`    | String | True      | Role of the message (`system`, 
`user`, `assistant`).|
+| `messages.content` | String | True      | Content of the message.            
                 |
+
+## Attributes
+
+| Name                               | Type            | Required | Default    
                       | Valid Values | Description |
+|------------------------------------|----------------|----------|-----------------------------------|--------------|-------------|
+| fallback_strategy                  | string         | False    | 
instance_health_and_rate_limiting | instance_health_and_rate_limiting | 
Fallback strategy. When set, the Plugin will check whether the specified 
instance’s token has been exhausted when a request is forwarded. If so, forward 
the request to the next instance regardless of the instance priority. When not 
set, the Plugin will not forward the request to low priority instances when 
token of the high priority instance is exhausted. |
+| balancer                           | object         | False    |             
                      |              | Load balancing configurations. |
+| balancer.algorithm                 | string         | False    | roundrobin  
                   | [roundrobin, chash] | Load balancing algorithm. When set 
to `roundrobin`, weighted round robin algorithm is used. When set to `chash`, 
consistent hashing algorithm is used. |
+| balancer.hash_on                   | string         | False    |             
                      | [vars, headers, cookie, consumer, vars_combinations] | 
Used when `type` is `chash`. Support hashing on [NGINX 
variables](https://nginx.org/en/docs/varindex.html), headers, cookie, consumer, 
or a combination of [NGINX variables](https://nginx.org/en/docs/varindex.html). 
|
+| balancer.key                       | string         | False    |             
                      |              | Used when `type` is `chash`. When 
`hash_on` is set to `header` or `cookie`, `key` is required. When `hash_on` is 
set to `consumer`, `key` is not required as the consumer name will be used as 
the key automatically. |
+| instances                          | array[object]  | True     |             
                      |              | LLM instance configurations. |
+| instances.name                     | string         | True     |             
                      |              | Name of the LLM service instance. |
+| instances.provider                 | string         | True     |             
                      | [openai, deepseek, openai-compatible] | LLM service 
provider. When set to `openai`, the Plugin will proxy the request to 
`api.openai.com`. When set to `deepseek`, the Plugin will proxy the request to 
`api.deepseek.com`. When set to `openai-compatible`, the Plugin will proxy the 
request to the custom endpoint configured in `override`. |
+| instances.priority                  | integer        | False    | 0          
                     |              | Priority of the LLM instance in load 
balancing. `priority` takes precedence over `weight`. |
+| instances.weight                    | string         | True     | 0          
                     | greater or equal to 0 | Weight of the LLM instance in 
load balancing. |
+| instances.auth                      | object         | True     |            
                       |              | Authentication configurations. |
+| instances.auth.header               | object         | False    |            
                       |              | Authentication headers. At least one of 
the `header` and `query` should be configured. |
+| instances.auth.query                | object         | False    |            
                       |              | Authentication query parameters. At 
least one of the `header` and `query` should be configured. |
+| instances.options                   | object         | False    |            
                       |              | Model configurations. In addition to 
`model`, you can configure additional parameters and they will be forwarded to 
the upstream LLM service in the request body. For instance, if you are working 
with OpenAI or DeepSeek, you can configure additional parameters such as 
`max_tokens`, `temperature`, `top_p`, and `stream`. See your LLM provider's API 
documentation for more available options. |
+| instances.options.model             | string         | False    |            
                       |              | Name of the LLM model, such as `gpt-4` 
or `gpt-3.5`. See your LLM provider's API documentation for more available 
models. |
+| logging                             | object         | False    |            
                       |              | Logging configurations. |
+| logging.summaries                   | boolean        | False    | false      
                     |              | If true, log request LLM model, duration, 
request, and response tokens. |
+| logging.payloads                    | boolean        | False    | false      
                     |              | If true, log request and response 
payload. |
+| logging.override                    | object         | False    |            
                       |              | Override setting. |
+| logging.override.endpoint           | string         | False    |            
                       |              | LLM provider endpoint to replace the 
default endpoint with. If not configured, the Plugin uses the default OpenAI 
endpoint `https://api.openai.com/v1/chat/completions`. |
+| checks                              | object         | False    |            
                       |              | Health check configurations. Note that 
at the moment, OpenAI and DeepSeek do not provide an official health check 
endpoint. Other LLM services that you can configure under `openai-compatible` 
provider may have available health check endpoints. |
+| checks.active                       | object         | True     |            
                       |              | Active health check configurations. |
+| checks.active.type                  | string         | False    | http       
                     | [http, https, tcp] | Type of health check connection. |
+| checks.active.timeout               | number         | False    | 1          
                     |              | Health check timeout in seconds. |
+| checks.active.concurrency           | integer        | False    | 10         
                     |              | Number of upstream nodes to be checked at 
the same time. |
+| checks.active.host                  | string         | False    |            
                       |              | HTTP host. |
+| checks.active.port                  | integer        | False    |            
                       | between 1 and 65535 inclusive | HTTP port. |
+| checks.active.http_path             | string         | False    | /          
                     |              | Path for HTTP probing requests. |
+| checks.active.https_verify_certificate | boolean   | False    | true         
                   |              | If true, verify the node's TLS certificate. 
|
+| timeout                             | integer        | False    | 30000      
                     | greater than or equal to 1 | Request timeout in 
milliseconds when requesting the LLM service. |
+| keepalive                           | boolean        | False    | true       
                     |              | If true, keep the connection alive when 
requesting the LLM service. |
+| keepalive_timeout                   | integer        | False    | 60000      
                     | greater than or equal to 1000 | Request timeout in 
milliseconds when requesting the LLM service. |
+| keepalive_pool                      | integer        | False    | 30         
                     |              | Keepalive pool size for when connecting 
with the LLM service. |
+| ssl_verify                          | boolean        | False    | true       
                     |              | If true, verify the LLM service's 
certificate. |
+
+## Examples
+
+The examples below demonstrate how you can configure `ai-proxy-multi` for 
different scenarios.
+
+:::note
+
+You can fetch the `admin_key` from `config.yaml` and save to an environment 
variable with the following command:
+
+```bash
+admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 
's/"//g')
+```
+
+:::
+
+### Load Balance between Instances
+
+The following example demonstrates how you can configure two models for load 
balancing, forwarding 80% of the traffic to one instance and 20% to the other.
+
+For demonstration and easier differentiation, you will be configuring one 
OpenAI instance and one DeepSeek instance as the upstream LLM services.
+
+Create a Route as such and update with your LLM providers, models, API keys, 
and endpoints if applicable:
 
 ```shell
 curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
-  -H "X-API-KEY: ${ADMIN_API_KEY}" \
+  -H "X-API-KEY: ${admin_key}" \
   -d '{
     "id": "ai-proxy-multi-route",
     "uri": "/anything",
     "methods": ["POST"],
     "plugins": {
       "ai-proxy-multi": {
-        "providers": [
+        "instances": [
           {
-            "name": "openai",
-            "model": "gpt-4",
-            "weight": 1,
-            "priority": 1,
+            "name": "openai-instance",
+            "provider": "openai",
+            "weight": 8,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$OPENAI_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "gpt-4"
             }
           },
           {
-            "name": "deepseek",
-            "model": "deepseek-chat",
-            "weight": 1,
+            "name": "deepseek-instance",
+            "provider": "deepseek",
+            "weight": 2,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "deepseek-chat"
             }
           }
         ]
       }
-    },
-    "upstream": {
-      "type": "roundrobin",
-      "nodes": {
-        "httpbin.org": 1
-      }
     }
   }'
 ```
 
-In the above configuration, requests will be equally balanced among the 
`openai` and `deepseek` providers.
+Send 10 POST requests to the Route with a system prompt and a sample user 
question in the request body, to see the number of requests forwarded to OpenAI 
and DeepSeek:
 
-### Retry and fallback:
+```shell
+openai_count=0
+deepseek_count=0
+
+for i in {1..10}; do
+  model=$(curl -s "http://127.0.0.1:9080/anything"; -X POST \
+    -H "Content-Type: application/json" \
+    -d '{
+      "messages": [
+        { "role": "system", "content": "You are a mathematician" },
+        { "role": "user", "content": "What is 1+1?" }
+      ]
+    }' | jq -r '.model')
+
+  if [[ "$model" == *"gpt-4"* ]]; then
+    ((openai_count++))
+  elif [[ "$model" == "deepseek-chat" ]]; then
+    ((deepseek_count++))
+  fi
+done
+
+echo "OpenAI responses: $openai_count"
+echo "DeepSeek responses: $deepseek_count"
+```
+
+You should see a response similar to the following:
+
+```text
+OpenAI responses: 8
+DeepSeek responses: 2
+```
+
+### Configure Instance Priority and Rate Limiting
+
+The following example demonstrates how you can configure two models with 
different priorities and apply rate limiting on the instance with a higher 
priority. In the case where `fallback_strategy` is set to 
`instance_health_and_rate_limiting`, the Plugin should continue to forward 
requests to the low priority instance once the high priority instance's rate 
limiting quota is fully consumed.
 
-The `priority` attribute can be adjusted to implement the fallback and retry 
feature.
+Create a Route as such and update with your LLM providers, models, API keys, 
and endpoints if applicable:
 
 ```shell
 curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
-  -H "X-API-KEY: ${ADMIN_API_KEY}" \
+  -H "X-API-KEY: ${admin_key}" \
   -d '{
     "id": "ai-proxy-multi-route",
     "uri": "/anything",
     "methods": ["POST"],
     "plugins": {
       "ai-proxy-multi": {
-        "providers": [
+        "fallback_strategy: "instance_health_and_rate_limiting",
+        "instances": [
           {
-            "name": "openai",
-            "model": "gpt-4",
-            "weight": 1,
+            "name": "openai-instance",
+            "provider": "openai",
             "priority": 1,
+            "weight": 0,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$OPENAI_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "gpt-4"
             }
           },
           {
-            "name": "deepseek",
-            "model": "deepseek-chat",
-            "weight": 1,
+            "name": "deepseek-instance",
+            "provider": "deepseek",
             "priority": 0,
+            "weight": 0,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "deepseek-chat"
             }
           }
         ]
+      },
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "limit_strategy": "total_tokens"
       }
+    }
+  }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "1+1 equals 2.",
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 8,
+    "total_tokens": 31,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
     },
-    "upstream": {
-      "type": "roundrobin",
-      "nodes": {
-        "httpbin.org": 1
+    "completion_tokens_details": {
+      "reasoning_tokens": 0,
+      "audio_tokens": 0,
+      "accepted_prediction_tokens": 0,
+      "rejected_prediction_tokens": 0
+    }
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `10`, the next 
request within the 60-second window is expected to be forwarded to the other 
instance.
+
+Within the same 60-second window, send another POST request to the route:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newton law" }
+    ]
+  }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Certainly! Newton's laws of motion are three fundamental 
principles that describe the relationship between the motion of an object and 
the forces acting on it. They were formulated by Sir Isaac Newton in the late 
17th century and are foundational to classical mechanics.\n\n---\n\n### **1. 
Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will 
remain at rest, and an object in motion will continue moving at a constant 
velocity (in a straight line at a constant speed), unless acted upon by an 
external force.\n- **Key Idea:** This law introduces the concept of 
**inertia**, which is the tendency of an object to resist changes in its state 
of motion.\n- **Example:** If you slide a book across a table, it eventually 
stops because of the force of friction acting on it. Without friction, the book 
would keep moving indefinitely.\n\n---\n\n### **2. Newton's Second Law (Law of 
Acceleration):**\n- **Statement:** The acceleration of an object is direct
 ly proportional to the net force acting on it and inversely proportional to 
its mass. Mathematically, this is expressed as:\n  \\[\n  F = ma\n  \\]\n  
where:\n  - \\( F \\) = net force applied (in Newtons),\n  -"
+      },
+      ...
+    }
+  ],
+  ...
+}
+```
+
+### Load Balance and Rate Limit by Consumers
+
+The following example demonstrates how you can configure two models for load 
balancing and apply rate limiting by consumer.
+
+Create a consumer `johndoe` and a rate limiting quota of 10 tokens in a 
60-second window on `openai-instance` instance:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "username": "johndoe",
+    "plugins": {
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "rejected_code": 429,
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
+
+Configure `key-auth` credential for `johndoe`:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers/johndoe/credentials"; -X PUT 
\
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "cred-john-key-auth",
+    "plugins": {
+      "key-auth": {
+        "key": "john-key"
       }
     }
   }'
 ```
 
-In the above configuration `priority` for the deepseek provider is set to `0`. 
Which means if `openai` provider is unavailable then `ai-proxy-multi` plugin 
will retry sending request to `deepseek` in the second attempt.
+Create another consumer `janedoe` and a rate limiting quota of 10 tokens in a 
60-second window on `deepseek-instance` instance:

Review Comment:
   ```suggestion
   Create another Consumer `janedoe` and a rate limiting quota of 10 tokens in 
a 60-second window on `deepseek-instance` instance:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@apisix.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] docs: update `ai-proxy` doc and `ai-proxy-multi` plugin doc [apisix]

Reply via email to