nic-6443 commented on code in PR #13487:
URL: https://github.com/apache/apisix/pull/13487#discussion_r3378681434
##########
apisix/plugins/prometheus/exporter.lua:
##########
@@ -366,16 +416,36 @@ function _M.http_log(conf, ctx)
gen_arr(route_id, service_id, consumer_name, balancer_ip,
vars.request_type, vars.request_llm_model, vars.llm_model,
unpack(extra_labels("llm_latency", ctx))))
+
+ -- Only streaming requests expose a real TTFT; for non-streaming
the
+ -- var holds the total response time, which would pollute the TTFT
+ -- distribution, so record llm_ttft for ai_stream only.
+ if vars.request_type == "ai_stream" then
+ metrics.llm_ttft:observe(tonumber(llm_time_to_first_token),
+ gen_arr(route_id, service_id, consumer_name, balancer_ip,
+ vars.request_type, vars.request_llm_model,
vars.llm_model,
+ unpack(extra_labels("llm_ttft", ctx))))
+ end
Review Comment:
> `apisix_llm_ttft` — LLM time to first token (milliseconds), observed for
streaming (ai_stream) requests only. The existing apisix_llm_latency mixes
streaming TTFT and non-streaming total latency in one series; this dedicated
metric keeps the TTFT distribution semantically consistent so it can be used
for streaming latency SLOs.
Then we should also adjust the existing `apisix_llm_latency`. In stream
request, set it to the time when the entire response is completed. Otherwise,
`apisix_llm_ttft` and `apisix_llm_latency` will have overlapping functions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]