nic-6443 commented on code in PR #13606:
URL: https://github.com/apache/apisix/pull/13606#discussion_r3472593856
##########
apisix/plugins/ai-lakera-guard.lua:
##########
@@ -206,4 +215,60 @@ function _M.access(conf, ctx)
end
+function _M.lua_body_filter(conf, ctx, headers, body)
+ if conf.direction ~= "output" and conf.direction ~= "both" then
+ return
+ end
+
+ if ngx.status >= 400 then
+ return
+ end
+
+ -- Non-streaming: ai-proxy hands us the fully-assembled completion text.
+ if ctx.var.request_type == "ai_chat" then
+ local text = ctx.var.llm_response_text
+ if not text or text == "" then
+ return
+ end
+ local messages = { { role = "assistant", content = text } }
+ return moderate(ctx, conf, messages, "response",
conf.response_failure_message)
+ end
+
+ -- Streaming: lua_body_filter is invoked once per upstream chunk. We cannot
+ -- scan a partial completion and we must not let flagged tokens reach the
+ -- client, so we buffer every chunk (withholding it with an empty body) and
+ -- scan the assembled completion once at end-of-stream. This trades
+ -- incremental delivery for true blocking.
+ if ctx.var.request_type == "ai_stream" then
Review Comment:
Small optimization for shadow mode: when `action: alert`, this branch still
buffers the entire stream and withholds every chunk (the `:\n\n` heartbeats)
until end-of-stream, then releases it all at once — so shadow mode pays the
full latency/TTFT cost even though it never blocks anything. `action` is known
up front, so alert mode could skip buffering entirely: let each chunk pass
through live (return nothing), and just scan `ctx.var.llm_response_text` and
log once at `llm_request_done`. That keeps shadow mode zero-impact on the
stream, which is rather the point of running it. The withhold-and-buffer path
is only needed for `action: block`.
##########
apisix/plugins/ai-lakera-guard.lua:
##########
@@ -206,4 +215,60 @@ function _M.access(conf, ctx)
end
+function _M.lua_body_filter(conf, ctx, headers, body)
+ if conf.direction ~= "output" and conf.direction ~= "both" then
+ return
+ end
+
+ if ngx.status >= 400 then
+ return
+ end
+
+ -- Non-streaming: ai-proxy hands us the fully-assembled completion text.
+ if ctx.var.request_type == "ai_chat" then
+ local text = ctx.var.llm_response_text
+ if not text or text == "" then
+ return
+ end
+ local messages = { { role = "assistant", content = text } }
+ return moderate(ctx, conf, messages, "response",
conf.response_failure_message)
+ end
+
+ -- Streaming: lua_body_filter is invoked once per upstream chunk. We cannot
+ -- scan a partial completion and we must not let flagged tokens reach the
+ -- client, so we buffer every chunk (withholding it with an empty body) and
Review Comment:
Two minor things while you're here:
- This comment says the chunk is withheld "with an empty body", but it's
actually replaced with a `:\n\n` SSE keep-alive (line 251). Worth fixing the
wording to match — and maybe noting *why* a keep-alive rather than `""`: an
empty string trips nginx's "nothing to flush", and returning `nil` would let
the original chunk leak to the client.
- `local messages = { { role = "assistant", content = text } }` followed by
the `moderate(..., "response", conf.response_failure_message)` call is
duplicated verbatim between this streaming branch and the non-streaming
`ai_chat` branch above. A small `moderate_response(ctx, conf, text)` local
would collapse both into one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]