SubhamSinghal opened a new pull request, #1796:
URL: https://github.com/apache/datafusion-ballista/pull/1796

    # Which issue does this PR close?
   
   # Rationale for this change
   
     Today every query against a Ballista cluster opens a brand-new 
TCP+TLS+HTTP/2 connection to the scheduler, runs
     `execute_query` plus a polling loop of `get_job_status`, and discards the 
connection. Even when the same
     `SessionContext` runs many queries against one scheduler URL, the 
handshake is paid on every query.
   
     `tonic::transport::Channel` is `Clone` and HTTP/2 multiplexes concurrent 
RPCs over one connection by design — so
     the per-query reconnect is throwing away exactly the thing meant to be 
shared. For interactive REPLs and JDBC
     sessions, this is a constant per-query latency floor (and worse on 
TLS-enabled deployments).
   
     Two `TODO reuse the scheduler to avoid connecting to the Ballista 
scheduler again and again` markers flag this in
      the source today:
   
     - `ballista/core/src/execution_plans/distributed_query.rs:354` 
(`execute_query_pull`)
     - `ballista/core/src/execution_plans/distributed_query.rs:527` 
(`execute_query_push`)
   
     # What changes are included in this PR?
   
     Adds a per-planner, lazily-initialized, idle-evicting, single-flight 
`SchedulerChannelCache` that holds one
     `tonic::transport::Channel` per `BallistaQueryPlanner`. The cache is 
constructed once per planner (≈ once per
     `SessionContext`) and cloned into every `DistributedQueryExec` it builds, 
so all queries on a session share one
     HTTP/2 connection.
   
     # Are these changes tested?
   
     Yes:
   
     - 9 unit tests on `SchedulerChannelCache`: lazy init, single-flight under 
16 concurrent callers, 
     build-failure-doesn't-poison, `invalidate()` swap semantics, idle eviction 
with and without an in-flight build, 
     drop-of-cache exits the eviction task (proving the eviction task holds 
only a `Weak`), and the 
     `is_transport_error` predicate.
     - 1 unit test in `BallistaQueryPlanner` asserting that two 
`create_physical_plan` calls produce exec nodes whose 
     caches share the same `Arc<Inner>` — the invariant that makes the feature 
actually work end-to-end.
     - All pre-existing `ballista-core` tests pass unchanged (108 / 108 lib 
tests green).
   
     # Are there any user-facing changes?
   
     One new config option:
   
     - `ballista.client.scheduler_channel_idle_timeout_seconds` — idle timeout 
in seconds for the client-side cached 
     scheduler gRPC channel. After this many seconds with no query activity, 
the underlying TCP connection is closed 
     and the next query reconnects. Default `300`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to