morningman opened a new pull request, #64797:
URL: https://github.com/apache/doris/pull/64797

   ### What problem does this PR solve?
   
   Issue Number: ref #62259 (partial — robustness layer only, does not close 
the issue)
   
   Related PR: N/A
   
   Problem Summary:
   
   When an Arrow Flight SQL query scans an external table (e.g. Iceberg) in 
batch split mode, the BE fetches file splits from the FE via the 
`fetchSplitBatch` thrift RPC *during* the scan. If that fetch fails — most 
notably when the split source has already been released — the error path could 
crash the BE (SIGSEGV in 
`arrow::flight::internal::TransportStatus::FromStatus`) instead of failing the 
query gracefully (see #62259).
   
   This PR is the **robustness layer** for that issue: it ensures any 
`fetchSplitBatch` failure makes the query fail gracefully rather than crashing 
the BE. It does **not** fix the underlying split source lifecycle problem (the 
source being released after `GetFlightInfo` but before `DoGet` on the Arrow 
Flight two-phase path), which is tracked separately in the issue.
   
   Changes:
   
   1. **BE `split_source_connector`** — guard `result.status.error_msgs[0]` 
with an `empty()` check to avoid an out-of-bounds vector read when the FE 
returns a non-OK status without an error message.
   2. **BE `to_arrow_status`** — truncate the error message handed to 
Arrow/gRPC to a length well below 8192. The message is carried in the gRPC 
trailer (an HTTP2 header) and may be percent-encoded, so an oversized one can 
break the response or crash the flight transport status conversion. The 8192 
limit was already documented in a comment in this function but was never 
enforced. The full message is still logged on the BE.
   3. **FE `fetchSplitBatch`** — when the split source has been released, 
return a structured `TStatus(NOT_FOUND)` with a message instead of throwing a 
bare `TException`. The BE then receives a well-formed, non-empty error through 
the normal result path instead of a thrift transport exception.
   
   ### Release note
   
   Fix a BE crash (SIGSEGV) that could happen on the error path of Arrow Flight 
SQL queries against external tables when fetching split batches from the FE 
fails.
   
   ### Check List (For Author)
   
   - Test
       - [ ] Regression test
       - [ ] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [x] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [x] Other reason: This is defensive hardening of an error path (an 
out-of-bounds guard, a message-length cap, and returning a structured error 
status instead of throwing). It only triggers when `fetchSplitBatch` fails; 
existing tests cover the success path, and the crash depends on Arrow/gRPC 
transport internals that are hard to reproduce deterministically in CI.
   
   - Behavior changed:
       - [x] Yes. On `fetchSplitBatch` failure the query now fails with a clear 
error message instead of (potentially) crashing the BE, and the FE no longer 
throws a bare `TException` for a released split source (it returns a 
`NOT_FOUND` status).
   
   - Does this need documentation?
       - [x] No.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to