morningman commented on code in PR #64799:
URL: https://github.com/apache/doris/pull/64799#discussion_r3487272034
##########
fe/fe-core/src/main/java/org/apache/doris/service/arrowflight/DorisFlightSqlProducer.java:
##########
@@ -189,6 +189,10 @@ private FlightInfo executeQueryStatement(String
peerIdentity, ConnectContext con
try {
Preconditions.checkState(null != connectContext);
Preconditions.checkState(!query.isEmpty());
+ // Finalize the previous query's coordinator on this connection
whose close was
+ // deferred (Arrow Flight keeps it alive across GetFlightInfo ->
DoGet so the BE can
+ // fetch external-table splits during DoGet). By now the previous
DoGet is done. #62259
+ connectContext.closeFlightSqlDeferredExecutors();
// After the previous query was executed, there was no
getStreamStatement to take away the result.
Review Comment:
I looked into this and don't think it's a regression this PR introduces, nor
something that can be properly fixed within this PR's scope. Details:
**Not a new failure mode / not a regression.** Before this PR, *every*
external-table batch-split query over Arrow Flight already failed with `Split
source X is released`, because the coordinator was closed at the end of
`GetFlightInfo` — sequential queries included (that is exactly #62259). After
this PR the sequential case works; the interleaved case you describe fails as
it did before. So nothing is "reintroduced" — interleaving never worked.
**Interleaving two queries on one session is not a supported usage.** A
bearer token maps to a single shared `ConnectContext`
(`FlightSessionsWithTokenManager`), and that context is
single-query-by-construction: each `GetFlightInfo` resets the shared
`FlightSqlChannel`, clears the single `flightSqlEndpointsLocations` list, and
reuses one `queryId`. Two concurrent/interleaved queries on one session corrupt
all of that, not just the deferred coordinator — the same way sharing one JDBC
`Connection` across threads is unsafe. The supported pattern (drain the result,
then issue the next query) makes "the previous DoGet is done by the time the
next query starts" hold, since the BE has consumed the SplitSource by the time
the client reaches end-of-stream.
**The suggested fix needs a signal the FE does not have.** For an
external-table scan the result `DoGet` endpoint points directly at the BE; the
FE is not in that data path and gets no completion signal when the BE-side
stream finishes. "Keep the coordinator until the result stream completes" or
"serialize/reject a new statement while a result is live" both require a new BE
-> FE completion notification (or session-level serialization keyed on it),
which is a separate architectural change out of scope here.
**Leaks are already bounded.** The "or the session is closed" fallback is
implemented: `FlightSqlConnectPoolMgr.unregisterConnection()` ->
`closeFlightSqlDeferredExecutors()` covers idle/query timeout, bearer-token
expiry and explicit `CloseSession`. And the BE-side hardening in #64797 makes a
stale `fetchSplitBatch` fail gracefully instead of crashing the BE.
If first-class concurrent Flight clients are ever needed, the right
follow-up is a BE -> FE completion signal plus session serialization; I can
track that separately.
##########
fe/fe-core/src/main/java/org/apache/doris/service/arrowflight/FlightSqlConnectProcessor.java:
##########
@@ -196,11 +196,20 @@ public void fetchArrowFlightSchema(int timeoutMs) {
@Override
public void close() throws Exception {
ctx.setCommand(MysqlCommand.COM_SLEEP);
+ // Executors whose results are pulled from the BE keep their
coordinator alive past
+ // GetFlightInfo (registered as deferred executors on the
ConnectContext) so the BE can
+ // still fetch external-table splits during DoGet. Do NOT finalize
those here; they are
+ // finalized when the next query starts or the connection is torn
down. Executors that are
+ // not deferred (local results, or a query that already failed) are
finalized now. See #62259.
for (StmtExecutor asynExecutor : returnResultFromRemoteExecutor) {
- asynExecutor.finalizeQuery();
+ if (!asynExecutor.isDeferredForArrowFlight()) {
+ asynExecutor.finalizeQuery();
+ }
Review Comment:
Confirmed — this is a real (bounded) leak. If the Arrow schema fetch in
`executeQueryStatement` fails after the coordinator was already deferred during
planning, `FlightSqlConnectProcessor.close()` skips the deferred executor and
the outer catch only rethrows, so the coordinator (its external-table batch
SplitSource, the query queue slot and the query registration) stays alive until
the next query starts or the connection is torn down — even though no `DoGet`
will ever pull this query's results.
Fixed in 41a79a659df by finalizing the deferred coordinator on the
`GetFlightInfo` error path: `executeQueryStatement`'s catch now calls
`connectContext.closeFlightSqlDeferredExecutors()`. At that point the list
holds only this failed query's coordinator (the previous one was already
finalized at the top of the method), and it covers every post-deferral failure,
not just the schema fetch.
Added
`DorisFlightSqlProducerTest.testGetFlightInfoFinalizesDeferredExecutorWhenSchemaFetchFails`,
which defers a coordinator then fails the schema fetch and asserts the
deferred executor is finalized (verified to fail without the fix).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]