Re: [PR] docs: update `ai-rate-limiting` and `ai-rag` docs [apisix]

via GitHub Sat, 05 Apr 2025 10:21:22 -0700


Yilialinn commented on code in PR #12107:
URL: https://github.com/apache/apisix/pull/12107#discussion_r2022197774



##########
docs/en/latest/plugins/ai-rate-limiting.md:
##########
@@ -116,4 +230,644 @@ You should receive a response similar to the following:
 }
 ```
 
-If rate limiting quota of 300 tokens has been consumed in a 30-second window, 
the additional requests will all be rejected.
+If `deepseek-instance-1` instance rate limiting quota of 100 tokens has been 
consumed in a 30-second window, the additional requests will all be forwarded 
to `deepseek-instance-2`, which is not rate limited.
+
+### Apply the Same Quota to All Instances
+
+The following example demonstrates how you can apply the same rate limiting 
quota to all LLM upstream instances in `ai-rate-limiting`.
+
+For demonstration and easier differentiation, you will be configuring one 
OpenAI instance and one DeepSeek instance as the upstream LLM services.
+
+Create a Route which applies a rate limiting quota of 100 total tokens for all 
instances within a 60-second window, and update with your LLM providers, 
models, API keys, and endpoints, if applicable:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-rate-limiting-route",
+    "uri": "/anything",
+    "methods": ["POST"],
+    "plugins": {
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "provider": "openai",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "gpt-4"
+            }
+          },
+          {
+            "name": "deepseek-instance",
+            "provider": "deepseek",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "deepseek-chat"
+            }
+          }
+        ]
+      },
+      "ai-rate-limiting": {
+        "limit": 100,
+        "time_window": 60,
+        "rejected_code": 429,
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
+
+```shell
+curl -i "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws" }
+    ]
+  }'
+```
+
+You should receive a response from either LLM instance, similar to the 
following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Sure! Sir Isaac Newton formulated three laws of motion 
that describe the motion of objects. These laws are widely used in physics and 
engineering for studying and understanding how things move. Here they 
are:\n\n1. Newton's First Law - Law of Inertia: An object at rest tends to stay 
at rest and an object in motion tends to stay in motion with the same speed and 
in the same direction unless acted upon by an unbalanced force. This is also 
known as the principle of inertia.\n\n2. Newton's Second Law of Motion - Force 
and Acceleration: The acceleration of an object is directly proportional to the 
net force acting on it and inversely proportional to its mass. This is usually 
formulated as F=ma where F is the force applied, m is the mass of the object 
and a is the acceleration produced.\n\n3. Newton's Third Law - Action and 
Reaction: For every action, there is an equal and opposite reaction. This means 
that any force exerted on a body will create a force of equal magni
 tude but in the opposite direction on the object that exerted the first 
force.\n\nIn simple terms: \n1. If you slide a book on a table and let go, it 
will stop because of the friction (or force) between it and the table.\n2.",
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "length"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 256,
+    "total_tokens": 279,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
+    },
+    "completion_tokens_details": {
+      "reasoning_tokens": 0,
+      "audio_tokens": 0,
+      "accepted_prediction_tokens": 0,
+      "rejected_prediction_tokens": 0
+    }
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `100`, the next 
request within the 60-second window is expected to be forwarded to the other 
instance.
+
+Within the same 60-second window, send another POST request to the Route:
+
+```shell
+curl -i "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws" }
+    ]
+  }'
+```
+
+You should receive a response from the other LLM instance, similar to the 
following:
+
+```json
+{
+  ...
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Sure! Newton's laws of motion are three fundamental 
principles that describe the relationship between the motion of an object and 
the forces acting on it. They were formulated by Sir Isaac Newton in the late 
17th century and are foundational to classical mechanics. Here's an explanation 
of each law:\n\n---\n\n### **1. Newton's First Law (Law of Inertia)**\n- 
**Statement**: An object will remain at rest or in uniform motion in a straight 
line unless acted upon by an external force.\n- **What it means**: This law 
introduces the concept of **inertia**, which is the tendency of an object to 
resist changes in its state of motion. If no net force acts on an object, its 
velocity (speed and direction) will not change.\n- **Example**: A book lying on 
a table will stay at rest unless you push it. Similarly, a hockey puck sliding 
on ice will keep moving at a constant speed unless friction or another force 
slows it down.\n\n---\n\n### **2. Newton's Second Law (Law of Acceler
 ation)**\n- **Statement**: The acceleration of an object is directly 
proportional to the net force acting on it and inversely proportional to its 
mass. Mathematically, this is expressed as:\n  \\[\n  F = ma\n  \\]\n"
+      },
+      "logprobs": null,
+      "finish_reason": "length"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 13,
+    "completion_tokens": 256,
+    "total_tokens": 269,
+    "prompt_tokens_details": {
+      "cached_tokens": 0
+    },
+    "prompt_cache_hit_tokens": 0,
+    "prompt_cache_miss_tokens": 13
+  },
+  "system_fingerprint": "fp_3a5770e1b4_prod0225"
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `100`, the next 
request within the 60-second window is expected to be rejected.
+
+Within the same 60-second window, send a third POST request to the Route:
+
+```shell
+curl -i "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws" }
+    ]
+  }'
+```
+
+You should receive an `HTTP 429 Too Many Requests` response and observe the 
following headers:
+
+```text
+X-AI-RateLimit-Limit-openai-instance: 100
+X-AI-RateLimit-Remaining-openai-instance: 0
+X-AI-RateLimit-Reset-openai-instance: 0
+X-AI-RateLimit-Limit-deepseek-instance: 100
+X-AI-RateLimit-Remaining-deepseek-instance: 0
+X-AI-RateLimit-Reset-deepseek-instance: 0
+```
+
+### Configure Instance Priority and Rate Limiting
+
+The following example demonstrates how you can configure two models with 
different priorities and apply rate limiting on the instance with a higher 
priority. In the case where `fallback_strategy` is set to 
`instance_health_and_rate_limiting`, the Plugin should continue to forward 
requests to the low priority instance once the high priority instance's rate 
limiting quota is fully consumed.
+
+Create a Route as such to set rate limiting and a higher priority on 
`openai-instance` instance and set the `fallback_strategy` to 
`instance_health_and_rate_limiting`. Update with your LLM providers, models, 
API keys, and endpoints, if applicable:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-rate-limiting-route",
+    "uri": "/anything",
+    "methods": ["POST"],
+    "plugins": {
+      "ai-proxy-multi": {
+        "fallback_strategy: "instance_health_and_rate_limiting",
+        "instances": [
+          {
+            "name": "openai-instance",
+            "provider": "openai",
+            "priority": 1,
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "gpt-4"
+            }
+          },
+          {
+            "name": "deepseek-instance",
+            "provider": "deepseek",
+            "priority": 0,
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "deepseek-chat"
+            }
+          }
+        ]
+      },
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "1+1 equals 2.",
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 8,
+    "total_tokens": 31,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
+    },
+    "completion_tokens_details": {
+      "reasoning_tokens": 0,
+      "audio_tokens": 0,
+      "accepted_prediction_tokens": 0,
+      "rejected_prediction_tokens": 0
+    }
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `10`, the next 
request within the 60-second window is expected to be forwarded to the other 
instance.
+
+Within the same 60-second window, send another POST request to the Route:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newton law" }
+    ]
+  }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Certainly! Newton's laws of motion are three fundamental 
principles that describe the relationship between the motion of an object and 
the forces acting on it. They were formulated by Sir Isaac Newton in the late 
17th century and are foundational to classical mechanics.\n\n---\n\n### **1. 
Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will 
remain at rest, and an object in motion will continue moving at a constant 
velocity (in a straight line at a constant speed), unless acted upon by an 
external force.\n- **Key Idea:** This law introduces the concept of 
**inertia**, which is the tendency of an object to resist changes in its state 
of motion.\n- **Example:** If you slide a book across a table, it eventually 
stops because of the force of friction acting on it. Without friction, the book 
would keep moving indefinitely.\n\n---\n\n### **2. Newton's Second Law (Law of 
Acceleration):**\n- **Statement:** The acceleration of an object is direct
 ly proportional to the net force acting on it and inversely proportional to 
its mass. Mathematically, this is expressed as:\n  \\[\n  F = ma\n  \\]\n  
where:\n  - \\( F \\) = net force applied (in Newtons),\n  -"
+      },
+      ...
+    }
+  ],
+  ...
+}
+```
+
+### Load Balance and Rate Limit by Consumers
+
+The following example demonstrates how you can configure two models for load 
balancing and apply rate limiting by consumer.
+
+Create a consumer `johndoe` and a rate limiting quota of 10 tokens in a 
60-second window on `openai-instance` instance:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "username": "johndoe",
+    "plugins": {
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "rejected_code": 429,
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
+
+Configure `key-auth` credential for `johndoe`:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers/johndoe/credentials"; -X PUT 
\
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "cred-john-key-auth",
+    "plugins": {
+      "key-auth": {
+        "key": "john-key"
+      }
+    }
+  }'
+```
+
+Create another consumer `janedoe` and a rate limiting quota of 10 tokens in a 
60-second window on `deepseek-instance` instance:

Review Comment:
   check accordingly



##########
docs/en/latest/plugins/ai-rate-limiting.md:
##########
@@ -116,4 +230,644 @@ You should receive a response similar to the following:
 }
 ```
 
-If rate limiting quota of 300 tokens has been consumed in a 30-second window, 
the additional requests will all be rejected.
+If `deepseek-instance-1` instance rate limiting quota of 100 tokens has been 
consumed in a 30-second window, the additional requests will all be forwarded 
to `deepseek-instance-2`, which is not rate limited.
+
+### Apply the Same Quota to All Instances
+
+The following example demonstrates how you can apply the same rate limiting 
quota to all LLM upstream instances in `ai-rate-limiting`.
+
+For demonstration and easier differentiation, you will be configuring one 
OpenAI instance and one DeepSeek instance as the upstream LLM services.
+
+Create a Route which applies a rate limiting quota of 100 total tokens for all 
instances within a 60-second window, and update with your LLM providers, 
models, API keys, and endpoints, if applicable:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-rate-limiting-route",
+    "uri": "/anything",
+    "methods": ["POST"],
+    "plugins": {
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "provider": "openai",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "gpt-4"
+            }
+          },
+          {
+            "name": "deepseek-instance",
+            "provider": "deepseek",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "deepseek-chat"
+            }
+          }
+        ]
+      },
+      "ai-rate-limiting": {
+        "limit": 100,
+        "time_window": 60,
+        "rejected_code": 429,
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
+
+```shell
+curl -i "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws" }
+    ]
+  }'
+```
+
+You should receive a response from either LLM instance, similar to the 
following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Sure! Sir Isaac Newton formulated three laws of motion 
that describe the motion of objects. These laws are widely used in physics and 
engineering for studying and understanding how things move. Here they 
are:\n\n1. Newton's First Law - Law of Inertia: An object at rest tends to stay 
at rest and an object in motion tends to stay in motion with the same speed and 
in the same direction unless acted upon by an unbalanced force. This is also 
known as the principle of inertia.\n\n2. Newton's Second Law of Motion - Force 
and Acceleration: The acceleration of an object is directly proportional to the 
net force acting on it and inversely proportional to its mass. This is usually 
formulated as F=ma where F is the force applied, m is the mass of the object 
and a is the acceleration produced.\n\n3. Newton's Third Law - Action and 
Reaction: For every action, there is an equal and opposite reaction. This means 
that any force exerted on a body will create a force of equal magni
 tude but in the opposite direction on the object that exerted the first 
force.\n\nIn simple terms: \n1. If you slide a book on a table and let go, it 
will stop because of the friction (or force) between it and the table.\n2.",
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "length"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 256,
+    "total_tokens": 279,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
+    },
+    "completion_tokens_details": {
+      "reasoning_tokens": 0,
+      "audio_tokens": 0,
+      "accepted_prediction_tokens": 0,
+      "rejected_prediction_tokens": 0
+    }
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `100`, the next 
request within the 60-second window is expected to be forwarded to the other 
instance.
+
+Within the same 60-second window, send another POST request to the Route:
+
+```shell
+curl -i "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws" }
+    ]
+  }'
+```
+
+You should receive a response from the other LLM instance, similar to the 
following:
+
+```json
+{
+  ...
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Sure! Newton's laws of motion are three fundamental 
principles that describe the relationship between the motion of an object and 
the forces acting on it. They were formulated by Sir Isaac Newton in the late 
17th century and are foundational to classical mechanics. Here's an explanation 
of each law:\n\n---\n\n### **1. Newton's First Law (Law of Inertia)**\n- 
**Statement**: An object will remain at rest or in uniform motion in a straight 
line unless acted upon by an external force.\n- **What it means**: This law 
introduces the concept of **inertia**, which is the tendency of an object to 
resist changes in its state of motion. If no net force acts on an object, its 
velocity (speed and direction) will not change.\n- **Example**: A book lying on 
a table will stay at rest unless you push it. Similarly, a hockey puck sliding 
on ice will keep moving at a constant speed unless friction or another force 
slows it down.\n\n---\n\n### **2. Newton's Second Law (Law of Acceler
 ation)**\n- **Statement**: The acceleration of an object is directly 
proportional to the net force acting on it and inversely proportional to its 
mass. Mathematically, this is expressed as:\n  \\[\n  F = ma\n  \\]\n"
+      },
+      "logprobs": null,
+      "finish_reason": "length"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 13,
+    "completion_tokens": 256,
+    "total_tokens": 269,
+    "prompt_tokens_details": {
+      "cached_tokens": 0
+    },
+    "prompt_cache_hit_tokens": 0,
+    "prompt_cache_miss_tokens": 13
+  },
+  "system_fingerprint": "fp_3a5770e1b4_prod0225"
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `100`, the next 
request within the 60-second window is expected to be rejected.
+
+Within the same 60-second window, send a third POST request to the Route:
+
+```shell
+curl -i "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws" }
+    ]
+  }'
+```
+
+You should receive an `HTTP 429 Too Many Requests` response and observe the 
following headers:
+
+```text
+X-AI-RateLimit-Limit-openai-instance: 100
+X-AI-RateLimit-Remaining-openai-instance: 0
+X-AI-RateLimit-Reset-openai-instance: 0
+X-AI-RateLimit-Limit-deepseek-instance: 100
+X-AI-RateLimit-Remaining-deepseek-instance: 0
+X-AI-RateLimit-Reset-deepseek-instance: 0
+```
+
+### Configure Instance Priority and Rate Limiting
+
+The following example demonstrates how you can configure two models with 
different priorities and apply rate limiting on the instance with a higher 
priority. In the case where `fallback_strategy` is set to 
`instance_health_and_rate_limiting`, the Plugin should continue to forward 
requests to the low priority instance once the high priority instance's rate 
limiting quota is fully consumed.
+
+Create a Route as such to set rate limiting and a higher priority on 
`openai-instance` instance and set the `fallback_strategy` to 
`instance_health_and_rate_limiting`. Update with your LLM providers, models, 
API keys, and endpoints, if applicable:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-rate-limiting-route",
+    "uri": "/anything",
+    "methods": ["POST"],
+    "plugins": {
+      "ai-proxy-multi": {
+        "fallback_strategy: "instance_health_and_rate_limiting",
+        "instances": [
+          {
+            "name": "openai-instance",
+            "provider": "openai",
+            "priority": 1,
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "gpt-4"
+            }
+          },
+          {
+            "name": "deepseek-instance",
+            "provider": "deepseek",
+            "priority": 0,
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "deepseek-chat"
+            }
+          }
+        ]
+      },
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "1+1 equals 2.",
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 8,
+    "total_tokens": 31,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
+    },
+    "completion_tokens_details": {
+      "reasoning_tokens": 0,
+      "audio_tokens": 0,
+      "accepted_prediction_tokens": 0,
+      "rejected_prediction_tokens": 0
+    }
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `10`, the next 
request within the 60-second window is expected to be forwarded to the other 
instance.
+
+Within the same 60-second window, send another POST request to the Route:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newton law" }
+    ]
+  }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Certainly! Newton's laws of motion are three fundamental 
principles that describe the relationship between the motion of an object and 
the forces acting on it. They were formulated by Sir Isaac Newton in the late 
17th century and are foundational to classical mechanics.\n\n---\n\n### **1. 
Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will 
remain at rest, and an object in motion will continue moving at a constant 
velocity (in a straight line at a constant speed), unless acted upon by an 
external force.\n- **Key Idea:** This law introduces the concept of 
**inertia**, which is the tendency of an object to resist changes in its state 
of motion.\n- **Example:** If you slide a book across a table, it eventually 
stops because of the force of friction acting on it. Without friction, the book 
would keep moving indefinitely.\n\n---\n\n### **2. Newton's Second Law (Law of 
Acceleration):**\n- **Statement:** The acceleration of an object is direct
 ly proportional to the net force acting on it and inversely proportional to 
its mass. Mathematically, this is expressed as:\n  \\[\n  F = ma\n  \\]\n  
where:\n  - \\( F \\) = net force applied (in Newtons),\n  -"
+      },
+      ...
+    }
+  ],
+  ...
+}
+```
+
+### Load Balance and Rate Limit by Consumers
+
+The following example demonstrates how you can configure two models for load 
balancing and apply rate limiting by consumer.
+
+Create a consumer `johndoe` and a rate limiting quota of 10 tokens in a 
60-second window on `openai-instance` instance:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "username": "johndoe",
+    "plugins": {
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "rejected_code": 429,
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
+
+Configure `key-auth` credential for `johndoe`:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers/johndoe/credentials"; -X PUT 
\
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "cred-john-key-auth",
+    "plugins": {
+      "key-auth": {
+        "key": "john-key"
+      }
+    }
+  }'
+```
+
+Create another consumer `janedoe` and a rate limiting quota of 10 tokens in a 
60-second window on `deepseek-instance` instance:

Review Comment:
   ```suggestion
   Create another Consumer `janedoe` and a rate limiting quota of 10 tokens in 
a 60-second window on `deepseek-instance` instance:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs: update `ai-rate-limiting` and `ai-rag` docs [apisix]

Reply via email to