Yilialinn commented on code in PR #12094: URL: https://github.com/apache/apisix/pull/12094#discussion_r2017801597
########## docs/en/latest/plugins/ai-proxy-multi.md: ########## @@ -27,215 +29,977 @@ description: This document contains information about the Apache APISIX ai-proxy # --> -## Description +<head> + <link rel="canonical" href="https://docs.api7.ai/hub/ai-proxy-multi" /> +</head> -The `ai-proxy-multi` plugin simplifies access to LLM providers and models by defining a standard request format -that allows key fields in plugin configuration to be embedded into the request. +## Description -This plugin adds additional features like `load balancing` and `retries` to the existing `ai-proxy` plugin. +The `ai-proxy-multi` Plugin simplifies access to LLM and embedding models by transforming Plugin configurations into the designated request format for OpenAI, DeepSeek, and other OpenAI-compatible APIs. It extends the capabilities of [`ai-proxy-multi`](./ai-proxy.md) with load balancing, retries, fallbacks, and health checks. -Proxying requests to OpenAI is supported now. Other LLM services will be supported soon. +In addition, the Plugin also supports logging LLM request information in the access log, such as token usage, model, time to the first response, and more. ## Request Format -### OpenAI - -- Chat API - | Name | Type | Required | Description | | ------------------ | ------ | -------- | --------------------------------------------------- | -| `messages` | Array | Yes | An array of message objects | -| `messages.role` | String | Yes | Role of the message (`system`, `user`, `assistant`) | -| `messages.content` | String | Yes | Content of the message | - -## Plugin Attributes - -| **Name** | **Required** | **Type** | **Description** | **Default** | -| ---------------------------- | ------------ | -------- | ------------------------------------------------------------------------------------------------------------- | ----------- | -| providers | Yes | array | List of AI providers, each following the provider schema. | | -| provider.name | Yes | string | Name of the AI service provider. Allowed values: `openai`, `deepseek`. | | -| provider.model | Yes | string | Name of the AI model to execute. Example: `gpt-4o`. | | -| provider.priority | No | integer | Priority of the provider for load balancing. | 0 | -| provider.weight | No | integer | Load balancing weight. | | -| balancer.algorithm | No | string | Load balancing algorithm. Allowed values: `chash`, `roundrobin`. | roundrobin | -| balancer.hash_on | No | string | Defines what to hash on for consistent hashing (`vars`, `header`, `cookie`, `consumer`, `vars_combinations`). | vars | -| balancer.key | No | string | Key for consistent hashing in dynamic load balancing. | | -| provider.auth | Yes | object | Authentication details, including headers and query parameters. | | -| provider.auth.header | No | object | Authentication details sent via headers. Header name must match `^[a-zA-Z0-9._-]+$`. | | -| provider.auth.query | No | object | Authentication details sent via query parameters. Keys must match `^[a-zA-Z0-9._-]+$`. | | -| provider.override.endpoint | No | string | Custom host override for the AI provider. | | -| timeout | No | integer | Request timeout in milliseconds (1-60000). | 30000 | -| keepalive | No | boolean | Enables keepalive connections. | true | -| keepalive_timeout | No | integer | Timeout for keepalive connections (minimum 1000ms). | 60000 | -| keepalive_pool | No | integer | Maximum keepalive connections. | 30 | -| ssl_verify | No | boolean | Enables SSL certificate verification. | true | - -## Example usage - -Create a route with the `ai-proxy-multi` plugin like so: +| `messages` | Array | True | An array of message objects. | +| `messages.role` | String | True | Role of the message (`system`, `user`, `assistant`).| +| `messages.content` | String | True | Content of the message. | + +## Attributes + +| Name | Type | Required | Default | Valid Values | Description | +|------------------------------------|----------------|----------|-----------------------------------|--------------|-------------| +| fallback_strategy | string | False | instance_health_and_rate_limiting | instance_health_and_rate_limiting | Fallback strategy. When set, the Plugin will check whether the specified instance’s token has been exhausted when a request is forwarded. If so, forward the request to the next instance regardless of the instance priority. When not set, the Plugin will not forward the request to low priority instances when token of the high priority instance is exhausted. | +| balancer | object | False | | | Load balancing configurations. | +| balancer.algorithm | string | False | roundrobin | [roundrobin, chash] | Load balancing algorithm. When set to `roundrobin`, weighted round robin algorithm is used. When set to `chash`, consistent hashing algorithm is used. | +| balancer.hash_on | string | False | | [vars, headers, cookie, consumer, vars_combinations] | Used when `type` is `chash`. Support hashing on [NGINX variables](https://nginx.org/en/docs/varindex.html), headers, cookie, consumer, or a combination of [NGINX variables](https://nginx.org/en/docs/varindex.html). | +| balancer.key | string | False | | | Used when `type` is `chash`. When `hash_on` is set to `header` or `cookie`, `key` is required. When `hash_on` is set to `consumer`, `key` is not required as the consumer name will be used as the key automatically. | +| instances | array[object] | True | | | LLM instance configurations. | +| instances.name | string | True | | | Name of the LLM service instance. | +| instances.provider | string | True | | [openai, deepseek, openai-compatible] | LLM service provider. When set to `openai`, the Plugin will proxy the request to `api.openai.com`. When set to `deepseek`, the Plugin will proxy the request to `api.deepseek.com`. When set to `openai-compatible`, the Plugin will proxy the request to the custom endpoint configured in `override`. | +| instances.priority | integer | False | 0 | | Priority of the LLM instance in load balancing. `priority` takes precedence over `weight`. | +| instances.weight | string | True | 0 | greater or equal to 0 | Weight of the LLM instance in load balancing. | +| instances.auth | object | True | | | Authentication configurations. | +| instances.auth.header | object | False | | | Authentication headers. At least one of the `header` and `query` should be configured. | +| instances.auth.query | object | False | | | Authentication query parameters. At least one of the `header` and `query` should be configured. | +| instances.options | object | False | | | Model configurations. In addition to `model`, you can configure additional parameters and they will be forwarded to the upstream LLM service in the request body. For instance, if you are working with OpenAI or DeepSeek, you can configure additional parameters such as `max_tokens`, `temperature`, `top_p`, and `stream`. See your LLM provider's API documentation for more available options. | +| instances.options.model | string | False | | | Name of the LLM model, such as `gpt-4` or `gpt-3.5`. See your LLM provider's API documentation for more available models. | +| logging | object | False | | | Logging configurations. | +| logging.summaries | boolean | False | false | | If true, log request LLM model, duration, request, and response tokens. | +| logging.payloads | boolean | False | false | | If true, log request and response payload. | +| logging.override | object | False | | | Override setting. | +| logging.override.endpoint | string | False | | | LLM provider endpoint to replace the default endpoint with. If not configured, the Plugin uses the default OpenAI endpoint `https://api.openai.com/v1/chat/completions`. | +| checks | object | False | | | Health check configurations. Note that at the moment, OpenAI and DeepSeek do not provide an official health check endpoint. Other LLM services that you can configure under `openai-compatible` provider may have available health check endpoints. | +| checks.active | object | True | | | Active health check configurations. | +| checks.active.type | string | False | http | [http, https, tcp] | Type of health check connection. | +| checks.active.timeout | number | False | 1 | | Health check timeout in seconds. | +| checks.active.concurrency | integer | False | 10 | | Number of upstream nodes to be checked at the same time. | +| checks.active.host | string | False | | | HTTP host. | +| checks.active.port | integer | False | | between 1 and 65535 inclusive | HTTP port. | +| checks.active.http_path | string | False | / | | Path for HTTP probing requests. | +| checks.active.https_verify_certificate | boolean | False | true | | If true, verify the node's TLS certificate. | +| timeout | integer | False | 30000 | greater than or equal to 1 | Request timeout in milliseconds when requesting the LLM service. | +| keepalive | boolean | False | true | | If true, keep the connection alive when requesting the LLM service. | +| keepalive_timeout | integer | False | 60000 | greater than or equal to 1000 | Request timeout in milliseconds when requesting the LLM service. | +| keepalive_pool | integer | False | 30 | | Keepalive pool size for when connecting with the LLM service. | +| ssl_verify | boolean | False | true | | If true, verify the LLM service's certificate. | + +## Examples + +The examples below demonstrate how you can configure `ai-proxy-multi` for different scenarios. + +:::note + +You can fetch the `admin_key` from `config.yaml` and save to an environment variable with the following command: + +```bash +admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 's/"//g') +``` + +::: + +### Load Balance between Instances + +The following example demonstrates how you can configure two models for load balancing, forwarding 80% of the traffic to one instance and 20% to the other. + +For demonstration and easier differentiation, you will be configuring one OpenAI instance and one DeepSeek instance as the upstream LLM services. + +Create a Route as such and update with your LLM providers, models, API keys, and endpoints if applicable: ```shell curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ - -H "X-API-KEY: ${ADMIN_API_KEY}" \ + -H "X-API-KEY: ${admin_key}" \ -d '{ "id": "ai-proxy-multi-route", "uri": "/anything", "methods": ["POST"], "plugins": { "ai-proxy-multi": { - "providers": [ + "instances": [ { - "name": "openai", - "model": "gpt-4", - "weight": 1, - "priority": 1, + "name": "openai-instance", + "provider": "openai", + "weight": 8, "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } }, "options": { - "max_tokens": 512, - "temperature": 1.0 + "model": "gpt-4" } }, { - "name": "deepseek", - "model": "deepseek-chat", - "weight": 1, + "name": "deepseek-instance", + "provider": "deepseek", + "weight": 2, "auth": { "header": { "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'" } }, "options": { - "max_tokens": 512, - "temperature": 1.0 + "model": "deepseek-chat" } } ] } - }, - "upstream": { - "type": "roundrobin", - "nodes": { - "httpbin.org": 1 - } } }' ``` -In the above configuration, requests will be equally balanced among the `openai` and `deepseek` providers. +Send 10 POST requests to the Route with a system prompt and a sample user question in the request body, to see the number of requests forwarded to OpenAI and DeepSeek: -### Retry and fallback: +```shell +openai_count=0 +deepseek_count=0 + +for i in {1..10}; do + model=$(curl -s "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "What is 1+1?" } + ] + }' | jq -r '.model') + + if [[ "$model" == *"gpt-4"* ]]; then + ((openai_count++)) + elif [[ "$model" == "deepseek-chat" ]]; then + ((deepseek_count++)) + fi +done + +echo "OpenAI responses: $openai_count" +echo "DeepSeek responses: $deepseek_count" +``` + +You should see a response similar to the following: + +```text +OpenAI responses: 8 +DeepSeek responses: 2 +``` + +### Configure Instance Priority and Rate Limiting + +The following example demonstrates how you can configure two models with different priorities and apply rate limiting on the instance with a higher priority. In the case where `fallback_strategy` is set to `instance_health_and_rate_limiting`, the Plugin should continue to forward requests to the low priority instance once the high priority instance's rate limiting quota is fully consumed. -The `priority` attribute can be adjusted to implement the fallback and retry feature. +Create a Route as such and update with your LLM providers, models, API keys, and endpoints if applicable: ```shell curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ - -H "X-API-KEY: ${ADMIN_API_KEY}" \ + -H "X-API-KEY: ${admin_key}" \ -d '{ "id": "ai-proxy-multi-route", "uri": "/anything", "methods": ["POST"], "plugins": { "ai-proxy-multi": { - "providers": [ + "fallback_strategy: "instance_health_and_rate_limiting", + "instances": [ { - "name": "openai", - "model": "gpt-4", - "weight": 1, + "name": "openai-instance", + "provider": "openai", "priority": 1, + "weight": 0, "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } }, "options": { - "max_tokens": 512, - "temperature": 1.0 + "model": "gpt-4" } }, { - "name": "deepseek", - "model": "deepseek-chat", - "weight": 1, + "name": "deepseek-instance", + "provider": "deepseek", "priority": 0, + "weight": 0, "auth": { "header": { "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'" } }, "options": { - "max_tokens": 512, - "temperature": 1.0 + "model": "deepseek-chat" } } ] + }, + "ai-rate-limiting": { + "instances": [ + { + "name": "openai-instance", + "limit": 10, + "time_window": 60 + } + ], + "limit_strategy": "total_tokens" } + } + }' +``` + +Send a POST request to the Route with a system prompt and a sample user question in the request body: + +```shell +curl "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "What is 1+1?" } + ] + }' +``` + +You should receive a response similar to the following: + +```json +{ + ..., + "model": "gpt-4-0613", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "1+1 equals 2.", + "refusal": null + }, + "logprobs": null, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 23, + "completion_tokens": 8, + "total_tokens": 31, + "prompt_tokens_details": { + "cached_tokens": 0, + "audio_tokens": 0 }, - "upstream": { - "type": "roundrobin", - "nodes": { - "httpbin.org": 1 + "completion_tokens_details": { + "reasoning_tokens": 0, + "audio_tokens": 0, + "accepted_prediction_tokens": 0, + "rejected_prediction_tokens": 0 + } + }, + "service_tier": "default", + "system_fingerprint": null +} +``` + +Since the `total_tokens` value exceeds the configured quota of `10`, the next request within the 60-second window is expected to be forwarded to the other instance. + +Within the same 60-second window, send another POST request to the route: + +```shell +curl "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "Explain Newton law" } + ] + }' +``` + +You should see a response similar to the following: + +```json +{ + ..., + "model": "deepseek-chat", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Certainly! Newton's laws of motion are three fundamental principles that describe the relationship between the motion of an object and the forces acting on it. They were formulated by Sir Isaac Newton in the late 17th century and are foundational to classical mechanics.\n\n---\n\n### **1. Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will remain at rest, and an object in motion will continue moving at a constant velocity (in a straight line at a constant speed), unless acted upon by an external force.\n- **Key Idea:** This law introduces the concept of **inertia**, which is the tendency of an object to resist changes in its state of motion.\n- **Example:** If you slide a book across a table, it eventually stops because of the force of friction acting on it. Without friction, the book would keep moving indefinitely.\n\n---\n\n### **2. Newton's Second Law (Law of Acceleration):**\n- **Statement:** The acceleration of an object is direct ly proportional to the net force acting on it and inversely proportional to its mass. Mathematically, this is expressed as:\n \\[\n F = ma\n \\]\n where:\n - \\( F \\) = net force applied (in Newtons),\n -" + }, + ... + } + ], + ... +} +``` + +### Load Balance and Rate Limit by Consumers + +The following example demonstrates how you can configure two models for load balancing and apply rate limiting by consumer. + +Create a consumer `johndoe` and a rate limiting quota of 10 tokens in a 60-second window on `openai-instance` instance: Review Comment: ```suggestion Create a Consumer `johndoe` and a rate limiting quota of 10 tokens in a 60-second window on `openai-instance` instance: ``` ########## docs/en/latest/plugins/ai-proxy-multi.md: ########## @@ -27,215 +29,977 @@ description: This document contains information about the Apache APISIX ai-proxy # --> -## Description +<head> + <link rel="canonical" href="https://docs.api7.ai/hub/ai-proxy-multi" /> +</head> -The `ai-proxy-multi` plugin simplifies access to LLM providers and models by defining a standard request format -that allows key fields in plugin configuration to be embedded into the request. +## Description -This plugin adds additional features like `load balancing` and `retries` to the existing `ai-proxy` plugin. +The `ai-proxy-multi` Plugin simplifies access to LLM and embedding models by transforming Plugin configurations into the designated request format for OpenAI, DeepSeek, and other OpenAI-compatible APIs. It extends the capabilities of [`ai-proxy-multi`](./ai-proxy.md) with load balancing, retries, fallbacks, and health checks. -Proxying requests to OpenAI is supported now. Other LLM services will be supported soon. +In addition, the Plugin also supports logging LLM request information in the access log, such as token usage, model, time to the first response, and more. ## Request Format -### OpenAI - -- Chat API - | Name | Type | Required | Description | | ------------------ | ------ | -------- | --------------------------------------------------- | -| `messages` | Array | Yes | An array of message objects | -| `messages.role` | String | Yes | Role of the message (`system`, `user`, `assistant`) | -| `messages.content` | String | Yes | Content of the message | - -## Plugin Attributes - -| **Name** | **Required** | **Type** | **Description** | **Default** | -| ---------------------------- | ------------ | -------- | ------------------------------------------------------------------------------------------------------------- | ----------- | -| providers | Yes | array | List of AI providers, each following the provider schema. | | -| provider.name | Yes | string | Name of the AI service provider. Allowed values: `openai`, `deepseek`. | | -| provider.model | Yes | string | Name of the AI model to execute. Example: `gpt-4o`. | | -| provider.priority | No | integer | Priority of the provider for load balancing. | 0 | -| provider.weight | No | integer | Load balancing weight. | | -| balancer.algorithm | No | string | Load balancing algorithm. Allowed values: `chash`, `roundrobin`. | roundrobin | -| balancer.hash_on | No | string | Defines what to hash on for consistent hashing (`vars`, `header`, `cookie`, `consumer`, `vars_combinations`). | vars | -| balancer.key | No | string | Key for consistent hashing in dynamic load balancing. | | -| provider.auth | Yes | object | Authentication details, including headers and query parameters. | | -| provider.auth.header | No | object | Authentication details sent via headers. Header name must match `^[a-zA-Z0-9._-]+$`. | | -| provider.auth.query | No | object | Authentication details sent via query parameters. Keys must match `^[a-zA-Z0-9._-]+$`. | | -| provider.override.endpoint | No | string | Custom host override for the AI provider. | | -| timeout | No | integer | Request timeout in milliseconds (1-60000). | 30000 | -| keepalive | No | boolean | Enables keepalive connections. | true | -| keepalive_timeout | No | integer | Timeout for keepalive connections (minimum 1000ms). | 60000 | -| keepalive_pool | No | integer | Maximum keepalive connections. | 30 | -| ssl_verify | No | boolean | Enables SSL certificate verification. | true | - -## Example usage - -Create a route with the `ai-proxy-multi` plugin like so: +| `messages` | Array | True | An array of message objects. | +| `messages.role` | String | True | Role of the message (`system`, `user`, `assistant`).| +| `messages.content` | String | True | Content of the message. | + +## Attributes + +| Name | Type | Required | Default | Valid Values | Description | +|------------------------------------|----------------|----------|-----------------------------------|--------------|-------------| +| fallback_strategy | string | False | instance_health_and_rate_limiting | instance_health_and_rate_limiting | Fallback strategy. When set, the Plugin will check whether the specified instance’s token has been exhausted when a request is forwarded. If so, forward the request to the next instance regardless of the instance priority. When not set, the Plugin will not forward the request to low priority instances when token of the high priority instance is exhausted. | +| balancer | object | False | | | Load balancing configurations. | +| balancer.algorithm | string | False | roundrobin | [roundrobin, chash] | Load balancing algorithm. When set to `roundrobin`, weighted round robin algorithm is used. When set to `chash`, consistent hashing algorithm is used. | +| balancer.hash_on | string | False | | [vars, headers, cookie, consumer, vars_combinations] | Used when `type` is `chash`. Support hashing on [NGINX variables](https://nginx.org/en/docs/varindex.html), headers, cookie, consumer, or a combination of [NGINX variables](https://nginx.org/en/docs/varindex.html). | +| balancer.key | string | False | | | Used when `type` is `chash`. When `hash_on` is set to `header` or `cookie`, `key` is required. When `hash_on` is set to `consumer`, `key` is not required as the consumer name will be used as the key automatically. | +| instances | array[object] | True | | | LLM instance configurations. | +| instances.name | string | True | | | Name of the LLM service instance. | +| instances.provider | string | True | | [openai, deepseek, openai-compatible] | LLM service provider. When set to `openai`, the Plugin will proxy the request to `api.openai.com`. When set to `deepseek`, the Plugin will proxy the request to `api.deepseek.com`. When set to `openai-compatible`, the Plugin will proxy the request to the custom endpoint configured in `override`. | +| instances.priority | integer | False | 0 | | Priority of the LLM instance in load balancing. `priority` takes precedence over `weight`. | +| instances.weight | string | True | 0 | greater or equal to 0 | Weight of the LLM instance in load balancing. | +| instances.auth | object | True | | | Authentication configurations. | +| instances.auth.header | object | False | | | Authentication headers. At least one of the `header` and `query` should be configured. | +| instances.auth.query | object | False | | | Authentication query parameters. At least one of the `header` and `query` should be configured. | +| instances.options | object | False | | | Model configurations. In addition to `model`, you can configure additional parameters and they will be forwarded to the upstream LLM service in the request body. For instance, if you are working with OpenAI or DeepSeek, you can configure additional parameters such as `max_tokens`, `temperature`, `top_p`, and `stream`. See your LLM provider's API documentation for more available options. | +| instances.options.model | string | False | | | Name of the LLM model, such as `gpt-4` or `gpt-3.5`. See your LLM provider's API documentation for more available models. | +| logging | object | False | | | Logging configurations. | +| logging.summaries | boolean | False | false | | If true, log request LLM model, duration, request, and response tokens. | +| logging.payloads | boolean | False | false | | If true, log request and response payload. | +| logging.override | object | False | | | Override setting. | +| logging.override.endpoint | string | False | | | LLM provider endpoint to replace the default endpoint with. If not configured, the Plugin uses the default OpenAI endpoint `https://api.openai.com/v1/chat/completions`. | +| checks | object | False | | | Health check configurations. Note that at the moment, OpenAI and DeepSeek do not provide an official health check endpoint. Other LLM services that you can configure under `openai-compatible` provider may have available health check endpoints. | +| checks.active | object | True | | | Active health check configurations. | +| checks.active.type | string | False | http | [http, https, tcp] | Type of health check connection. | +| checks.active.timeout | number | False | 1 | | Health check timeout in seconds. | +| checks.active.concurrency | integer | False | 10 | | Number of upstream nodes to be checked at the same time. | +| checks.active.host | string | False | | | HTTP host. | +| checks.active.port | integer | False | | between 1 and 65535 inclusive | HTTP port. | +| checks.active.http_path | string | False | / | | Path for HTTP probing requests. | +| checks.active.https_verify_certificate | boolean | False | true | | If true, verify the node's TLS certificate. | +| timeout | integer | False | 30000 | greater than or equal to 1 | Request timeout in milliseconds when requesting the LLM service. | +| keepalive | boolean | False | true | | If true, keep the connection alive when requesting the LLM service. | +| keepalive_timeout | integer | False | 60000 | greater than or equal to 1000 | Request timeout in milliseconds when requesting the LLM service. | +| keepalive_pool | integer | False | 30 | | Keepalive pool size for when connecting with the LLM service. | +| ssl_verify | boolean | False | true | | If true, verify the LLM service's certificate. | + +## Examples + +The examples below demonstrate how you can configure `ai-proxy-multi` for different scenarios. + +:::note + +You can fetch the `admin_key` from `config.yaml` and save to an environment variable with the following command: + +```bash +admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 's/"//g') +``` + +::: + +### Load Balance between Instances + +The following example demonstrates how you can configure two models for load balancing, forwarding 80% of the traffic to one instance and 20% to the other. + +For demonstration and easier differentiation, you will be configuring one OpenAI instance and one DeepSeek instance as the upstream LLM services. + +Create a Route as such and update with your LLM providers, models, API keys, and endpoints if applicable: ```shell curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ - -H "X-API-KEY: ${ADMIN_API_KEY}" \ + -H "X-API-KEY: ${admin_key}" \ -d '{ "id": "ai-proxy-multi-route", "uri": "/anything", "methods": ["POST"], "plugins": { "ai-proxy-multi": { - "providers": [ + "instances": [ { - "name": "openai", - "model": "gpt-4", - "weight": 1, - "priority": 1, + "name": "openai-instance", + "provider": "openai", + "weight": 8, "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } }, "options": { - "max_tokens": 512, - "temperature": 1.0 + "model": "gpt-4" } }, { - "name": "deepseek", - "model": "deepseek-chat", - "weight": 1, + "name": "deepseek-instance", + "provider": "deepseek", + "weight": 2, "auth": { "header": { "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'" } }, "options": { - "max_tokens": 512, - "temperature": 1.0 + "model": "deepseek-chat" } } ] } - }, - "upstream": { - "type": "roundrobin", - "nodes": { - "httpbin.org": 1 - } } }' ``` -In the above configuration, requests will be equally balanced among the `openai` and `deepseek` providers. +Send 10 POST requests to the Route with a system prompt and a sample user question in the request body, to see the number of requests forwarded to OpenAI and DeepSeek: -### Retry and fallback: +```shell +openai_count=0 +deepseek_count=0 + +for i in {1..10}; do + model=$(curl -s "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "What is 1+1?" } + ] + }' | jq -r '.model') + + if [[ "$model" == *"gpt-4"* ]]; then + ((openai_count++)) + elif [[ "$model" == "deepseek-chat" ]]; then + ((deepseek_count++)) + fi +done + +echo "OpenAI responses: $openai_count" +echo "DeepSeek responses: $deepseek_count" +``` + +You should see a response similar to the following: + +```text +OpenAI responses: 8 +DeepSeek responses: 2 +``` + +### Configure Instance Priority and Rate Limiting + +The following example demonstrates how you can configure two models with different priorities and apply rate limiting on the instance with a higher priority. In the case where `fallback_strategy` is set to `instance_health_and_rate_limiting`, the Plugin should continue to forward requests to the low priority instance once the high priority instance's rate limiting quota is fully consumed. -The `priority` attribute can be adjusted to implement the fallback and retry feature. +Create a Route as such and update with your LLM providers, models, API keys, and endpoints if applicable: ```shell curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ - -H "X-API-KEY: ${ADMIN_API_KEY}" \ + -H "X-API-KEY: ${admin_key}" \ -d '{ "id": "ai-proxy-multi-route", "uri": "/anything", "methods": ["POST"], "plugins": { "ai-proxy-multi": { - "providers": [ + "fallback_strategy: "instance_health_and_rate_limiting", + "instances": [ { - "name": "openai", - "model": "gpt-4", - "weight": 1, + "name": "openai-instance", + "provider": "openai", "priority": 1, + "weight": 0, "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } }, "options": { - "max_tokens": 512, - "temperature": 1.0 + "model": "gpt-4" } }, { - "name": "deepseek", - "model": "deepseek-chat", - "weight": 1, + "name": "deepseek-instance", + "provider": "deepseek", "priority": 0, + "weight": 0, "auth": { "header": { "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'" } }, "options": { - "max_tokens": 512, - "temperature": 1.0 + "model": "deepseek-chat" } } ] + }, + "ai-rate-limiting": { + "instances": [ + { + "name": "openai-instance", + "limit": 10, + "time_window": 60 + } + ], + "limit_strategy": "total_tokens" } + } + }' +``` + +Send a POST request to the Route with a system prompt and a sample user question in the request body: + +```shell +curl "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "What is 1+1?" } + ] + }' +``` + +You should receive a response similar to the following: + +```json +{ + ..., + "model": "gpt-4-0613", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "1+1 equals 2.", + "refusal": null + }, + "logprobs": null, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 23, + "completion_tokens": 8, + "total_tokens": 31, + "prompt_tokens_details": { + "cached_tokens": 0, + "audio_tokens": 0 }, - "upstream": { - "type": "roundrobin", - "nodes": { - "httpbin.org": 1 + "completion_tokens_details": { + "reasoning_tokens": 0, + "audio_tokens": 0, + "accepted_prediction_tokens": 0, + "rejected_prediction_tokens": 0 + } + }, + "service_tier": "default", + "system_fingerprint": null +} +``` + +Since the `total_tokens` value exceeds the configured quota of `10`, the next request within the 60-second window is expected to be forwarded to the other instance. + +Within the same 60-second window, send another POST request to the route: + +```shell +curl "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "Explain Newton law" } + ] + }' +``` + +You should see a response similar to the following: + +```json +{ + ..., + "model": "deepseek-chat", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Certainly! Newton's laws of motion are three fundamental principles that describe the relationship between the motion of an object and the forces acting on it. They were formulated by Sir Isaac Newton in the late 17th century and are foundational to classical mechanics.\n\n---\n\n### **1. Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will remain at rest, and an object in motion will continue moving at a constant velocity (in a straight line at a constant speed), unless acted upon by an external force.\n- **Key Idea:** This law introduces the concept of **inertia**, which is the tendency of an object to resist changes in its state of motion.\n- **Example:** If you slide a book across a table, it eventually stops because of the force of friction acting on it. Without friction, the book would keep moving indefinitely.\n\n---\n\n### **2. Newton's Second Law (Law of Acceleration):**\n- **Statement:** The acceleration of an object is direct ly proportional to the net force acting on it and inversely proportional to its mass. Mathematically, this is expressed as:\n \\[\n F = ma\n \\]\n where:\n - \\( F \\) = net force applied (in Newtons),\n -" + }, + ... + } + ], + ... +} +``` + +### Load Balance and Rate Limit by Consumers + +The following example demonstrates how you can configure two models for load balancing and apply rate limiting by consumer. + +Create a consumer `johndoe` and a rate limiting quota of 10 tokens in a 60-second window on `openai-instance` instance: + +```shell +curl "http://127.0.0.1:9180/apisix/admin/consumers" -X PUT \ + -H "X-API-KEY: ${admin_key}" \ + -d '{ + "username": "johndoe", + "plugins": { + "ai-rate-limiting": { + "instances": [ + { + "name": "openai-instance", + "limit": 10, + "time_window": 60 + } + ], + "rejected_code": 429, + "limit_strategy": "total_tokens" + } + } + }' +``` + +Configure `key-auth` credential for `johndoe`: + +```shell +curl "http://127.0.0.1:9180/apisix/admin/consumers/johndoe/credentials" -X PUT \ + -H "X-API-KEY: ${admin_key}" \ + -d '{ + "id": "cred-john-key-auth", + "plugins": { + "key-auth": { + "key": "john-key" } } }' ``` -In the above configuration `priority` for the deepseek provider is set to `0`. Which means if `openai` provider is unavailable then `ai-proxy-multi` plugin will retry sending request to `deepseek` in the second attempt. +Create another consumer `janedoe` and a rate limiting quota of 10 tokens in a 60-second window on `deepseek-instance` instance: Review Comment: ```suggestion Create another Consumer `janedoe` and a rate limiting quota of 10 tokens in a 60-second window on `deepseek-instance` instance: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@apisix.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org