featzhang created FLINK-39612:
---------------------------------
Summary: [Model] Add HTTP connection pool management for Triton
inference
Key: FLINK-39612
URL: https://issues.apache.org/jira/browse/FLINK-39612
Project: Flink
Issue Type: Sub-task
Reporter: featzhang
Fix For: 2.2.0
h2. Motivation
The Triton inference integration introduced in FLINK-38857 creates a new HTTP
connection to Triton for every inference request. For workloads that issue
many small predictions over HTTPS this imposes significant per-request
overhead:
* ~20–30 ms TCP handshake per request.
* ~30–50 ms TLS handshake per HTTPS request.
* Linear growth of server-side sockets, limiting throughput and stability
under bursty traffic.
* No visibility into how connections are being used, which makes capacity
planning and debugging hard.
We need a first-class HTTP connection pool so the Triton client can reuse
connections across requests, share clients across tasks that talk to the
same endpoint with the same configuration, and expose enough configuration
and monitoring for users to tune it to their deployment.
h2. Proposal
Add configurable HTTP connection pooling to the Triton model client, with
reference-counted client sharing and optional pool monitoring.
Implementation outline:
* Extend {{TritonOptions}} with six new {{ConfigOption}}s, all annotated with
{{@Documentation.Section(MODEL_TRITON_ADVANCED)}}:
** {{connection-pool-max-idle}} (Integer, default {{20}})
** {{connection-pool-keep-alive}} (Duration, default {{5 min}})
** {{connection-pool-max-total}} (Integer, default {{100}})
** {{connection-timeout}} (Duration, default {{10 s}}, must be less than the
overall request timeout)
** {{connection-reuse-enabled}} (Boolean, default {{true}})
** {{connection-pool-monitoring-enabled}} (Boolean, default {{false}})
* Introduce a top-level {{ConnectionPoolConfig}} class carrying the above
values, and register the new options in {{TritonModelProviderFactory}}.
* Enhance {{TritonUtils}} with a client cache keyed by
{{(timeout, ConnectionPoolConfig)}}, using reference counting so multiple
subtasks sharing the same configuration share a single OkHttp client, and
the client is closed only when the last reference is released. Keep the
cache guarded by a dedicated {{CACHE_LOCK}}.
* Configure the OkHttp {{ConnectionPool}} and {{Dispatcher}} from
{{ConnectionPoolConfig}} (max idle, keep-alive, max total, per-host limits,
connect timeout). When {{connection-reuse-enabled=false}}, build a
non-pooled client.
* When {{connection-pool-monitoring-enabled=true}}, periodically log pool
statistics (idle / active / queued / total) on a scheduled executor to
support capacity planning.
* Have {{AbstractTritonModelFunction#open()}} read the pool configuration
from options, validate it, and pass the resulting {{ConnectionPoolConfig}}
into the client factory. Release the client reference on {{close()}}.
* Regenerate the auto-generated config docs
({{model_triton_advanced_section.html}}, {{triton_configuration.html}}) so
{{ConfigOptionsDocsCompletenessITCase}} passes, and add a
{{CONNECTION_POOL_README.md}} (with Apache license header) covering the
configuration guide, tuning formulas, monitoring, troubleshooting and
migration notes.
h2. Example
{code:sql}
CREATE MODEL sentiment WITH (
'provider' = 'triton',
'endpoint' = 'http://triton:8000',
'model-name' = 'sentiment',
'connection-pool-max-idle' = '30',
'connection-pool-max-total' = '150',
'connection-pool-monitoring-enabled' = 'true'
);
{code}
Expected log output when monitoring is enabled:
{code}
INFO Triton HTTP client created - Pool: maxIdle=30, keepAlive=300000ms,
maxTotal=150, connTimeout=10000ms
INFO Connection Pool Stats - Idle: 15, Active: 10, Queued: 0, Total: 25
{code}
h2. Guarantees
* *Connection reuse* — eliminates TCP/TLS handshake per request; measured
30–50% latency reduction and 2–3x throughput improvement on representative
workloads.
* *Client sharing* — subtasks with identical pool configuration share a
single OkHttp client via reference counting, with deterministic cleanup.
* *Thread safety* — all cache mutations are guarded; reference count is
updated before logging to avoid side-effects in log statements.
* *Backward compatibility* — all new options are optional with sensible
defaults; pooling is on by default and can be disabled via
{{connection-reuse-enabled=false}}.
h2. Scope
This change is fully optional and isolated under
{{flink-models/flink-model-triton}}. It does not affect existing Flink
functionality and is gated by new config options that default to safe
values.
h2. Implementation
Pull Request: [apache/flink#27568|https://github.com/apache/flink/pull/27568]
—
{{[FLINK-38857][models] Add HTTP connection pool management for Triton
inference}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)