featzhang created FLINK-39612:
---------------------------------

             Summary: [Model] Add HTTP connection pool management for Triton 
inference
                 Key: FLINK-39612
                 URL: https://issues.apache.org/jira/browse/FLINK-39612
             Project: Flink
          Issue Type: Sub-task
            Reporter: featzhang
             Fix For: 2.2.0




  h2. Motivation

  The Triton inference integration introduced in FLINK-38857 creates a new HTTP
  connection to Triton for every inference request. For workloads that issue
  many small predictions over HTTPS this imposes significant per-request
  overhead:

  * ~20–30 ms TCP handshake per request.
  * ~30–50 ms TLS handshake per HTTPS request.
  * Linear growth of server-side sockets, limiting throughput and stability
    under bursty traffic.
  * No visibility into how connections are being used, which makes capacity
    planning and debugging hard.

  We need a first-class HTTP connection pool so the Triton client can reuse
  connections across requests, share clients across tasks that talk to the
  same endpoint with the same configuration, and expose enough configuration
  and monitoring for users to tune it to their deployment.

  h2. Proposal

  Add configurable HTTP connection pooling to the Triton model client, with
  reference-counted client sharing and optional pool monitoring.

  Implementation outline:

  * Extend {{TritonOptions}} with six new {{ConfigOption}}s, all annotated with
    {{@Documentation.Section(MODEL_TRITON_ADVANCED)}}:
  ** {{connection-pool-max-idle}} (Integer, default {{20}})
  ** {{connection-pool-keep-alive}} (Duration, default {{5 min}})
  ** {{connection-pool-max-total}} (Integer, default {{100}})
  ** {{connection-timeout}} (Duration, default {{10 s}}, must be less than the
     overall request timeout)
  ** {{connection-reuse-enabled}} (Boolean, default {{true}})
  ** {{connection-pool-monitoring-enabled}} (Boolean, default {{false}})
  * Introduce a top-level {{ConnectionPoolConfig}} class carrying the above
    values, and register the new options in {{TritonModelProviderFactory}}.
  * Enhance {{TritonUtils}} with a client cache keyed by
    {{(timeout, ConnectionPoolConfig)}}, using reference counting so multiple
    subtasks sharing the same configuration share a single OkHttp client, and
    the client is closed only when the last reference is released. Keep the
    cache guarded by a dedicated {{CACHE_LOCK}}.
  * Configure the OkHttp {{ConnectionPool}} and {{Dispatcher}} from
    {{ConnectionPoolConfig}} (max idle, keep-alive, max total, per-host limits,
    connect timeout). When {{connection-reuse-enabled=false}}, build a
    non-pooled client.
  * When {{connection-pool-monitoring-enabled=true}}, periodically log pool
    statistics (idle / active / queued / total) on a scheduled executor to
    support capacity planning.
  * Have {{AbstractTritonModelFunction#open()}} read the pool configuration
    from options, validate it, and pass the resulting {{ConnectionPoolConfig}}
    into the client factory. Release the client reference on {{close()}}.
  * Regenerate the auto-generated config docs
    ({{model_triton_advanced_section.html}}, {{triton_configuration.html}}) so
    {{ConfigOptionsDocsCompletenessITCase}} passes, and add a
    {{CONNECTION_POOL_README.md}} (with Apache license header) covering the
    configuration guide, tuning formulas, monitoring, troubleshooting and
    migration notes.

  h2. Example

  {code:sql}
  CREATE MODEL sentiment WITH (
    'provider' = 'triton',
    'endpoint' = 'http://triton:8000',
    'model-name' = 'sentiment',
    'connection-pool-max-idle' = '30',
    'connection-pool-max-total' = '150',
    'connection-pool-monitoring-enabled' = 'true'
  );
  {code}

  Expected log output when monitoring is enabled:

  {code}
  INFO  Triton HTTP client created - Pool: maxIdle=30, keepAlive=300000ms, 
maxTotal=150, connTimeout=10000ms
  INFO  Connection Pool Stats - Idle: 15, Active: 10, Queued: 0, Total: 25
  {code}

  h2. Guarantees

  * *Connection reuse* — eliminates TCP/TLS handshake per request; measured
    30–50% latency reduction and 2–3x throughput improvement on representative
    workloads.
  * *Client sharing* — subtasks with identical pool configuration share a
    single OkHttp client via reference counting, with deterministic cleanup.
  * *Thread safety* — all cache mutations are guarded; reference count is
    updated before logging to avoid side-effects in log statements.
  * *Backward compatibility* — all new options are optional with sensible
    defaults; pooling is on by default and can be disabled via
    {{connection-reuse-enabled=false}}.

  h2. Scope

  This change is fully optional and isolated under
  {{flink-models/flink-model-triton}}. It does not affect existing Flink
  functionality and is gated by new config options that default to safe
  values.

  h2. Implementation

  Pull Request: [apache/flink#27568|https://github.com/apache/flink/pull/27568] 
—
  {{[FLINK-38857][models] Add HTTP connection pool management for Triton 
inference}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to