Mryange opened a new pull request, #65140:
URL: https://github.com/apache/doris/pull/65140

   ### What problem does this PR solve?
   
   We are investigating an occasional query timeout where a pipeline task 
stayed runnable for almost 900s around a local exchange, cross join, and sort 
sink pipeline. The current evidence suggests the task is not blocked waiting 
for LocalExchange input: all local exchange sink operators have finished, the 
source queue still has data, and all dependencies of the runnable task are 
ready. We suspect the cross join may be stuck or making no progress when it 
receives an oversized block from an adaptive passthrough local exchange.
   
   Relevant debug info from the timeout log:
   
   ```text
   PipelineFragmentContext Info: _closed_tasks=22, _total_tasks=24, 
need_notify_close=false, fragment_id=10, _rec_cte_stage=0
   Task 0: QueryId: 160c012c0d7e49b0-bb5252d26339a281
   InstanceId: 160c012c0d7e49b0-bb5252d26339a2af
   PipelineTask[id = 0, open = true, eos = false, state = BLOCKED, dry run = 
false, _wake_up_early = false, _wake_up_by = -1, time elapsed since last state 
changing = 899s, spilling = false, is running = false] elapse time = 899s, 
block dependency = [LOCAL_MERGE_SORT_SOURCE_OPERATOR_DEPENDENCY: id=16, block 
task = 1, ready=false, _always_ready=false]
   LOCAL_MERGE_SORT_SOURCE_OPERATOR: id=16, parallel_tasks=3, 
_is_serial_operator=false
   
   Task 1: QueryId: 160c012c0d7e49b0-bb5252d26339a281
   InstanceId: 160c012c0d7e49b0-bb5252d26339a2af
   PipelineTask[id = 1, open = true, eos = false, state = RUNNABLE, dry run = 
false, _wake_up_early = false, _wake_up_by = -1, time elapsed since last state 
changing = 895s, spilling = false, is running = true] elapse time = 899s, block 
dependency = [NULL]
   LOCAL_EXCHANGE_OPERATOR(ADAPTIVE_PASSTHROUGH): id=20, parallel_tasks=3, 
_is_serial_operator=false, _channel_id: 0, _num_partitions: 3, _num_senders: 3, 
_num_sources: 3, _running_sink_operators: 0, _running_source_operators: 1, 
mem_usage: 1205248, data queue info: Data Queue 0: [size approx = 7, eos = 
false], MemTrackers: 0: 1205248, 1: 1203200, 2: 1205248,
     CROSS_JOIN_OPERATOR: id=15, parallel_tasks=3, _is_serial_operator=false
       SORT_SINK_OPERATOR: id=16, _is_serial_operator=false
   
   Read Dependency Information:
   0.   LOCAL_EXCHANGE_OPERATOR_DEPENDENCY: id=-3, block task = 0, ready=true, 
_always_ready=true
   1.     CROSS_JOIN_OPERATOR_DEPENDENCY: id=15, block task = 0, ready=true, 
_always_ready=false
   3.     MemorySufficientDependency: id=-1, block task = 0, ready=true, 
_always_ready=true
   
   Write Dependency Information:
   3.   SORT_SINK_OPERATOR_DEPENDENCY: id=16, block task = 0, ready=true, 
_always_ready=false
   ```
   
   This change adds defensive diagnostics to the common operator output and 
sink input paths to assert that block rows do not exceed `batch_size`. It also 
extends nested loop join probe debug output with the current child block rows, 
join block rows, probe/build cursor positions, build block row counts, and 
whether the operator is using the build-base generation path. These details 
should make the next timeout dump show whether the cross join is holding an 
oversized probe/build block and whether it is in a no-progress state.
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   - Test <!-- At least one of them must be included. -->
       - [ ] Regression test
       - [ ] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason <!-- Add your reason?  -->
   
   - Behavior changed:
       - [ ] No.
       - [ ] Yes. <!-- Explain the behavior change -->
   
   - Does this need documentation?
       - [ ] No.
       - [ ] Yes. <!-- Add document PR link here. eg: 
https://github.com/apache/doris-website/pull/1214 -->
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label <!-- Add branch pick label that this PR should 
merge into -->
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to