comphead commented on PR #21351:
URL: https://github.com/apache/datafusion/pull/21351#issuecomment-4255129097

   I asked Claude to write some tests on Shared mode and it found some 
potential concerns
   
   ```
    When a partition hits its row limit, scan_state.rs:175 runs:                
                                                                                
                                                                                
                                                                     
                                                                                
                                                                                
                                                                                
                                                                    
     let done = 1 + self.work_source.len();                                     
                                                                                
                                                                                
                                                                      
     self.metrics.files_processed.add(done);                                    
                                                                                
                                                                                
                                                                      
                                                                                
                                                                                
                                                                                
                                                                      
     This was correct before work stealing — work_source.len() returned the 
count of files owned exclusively by this stream. Hitting the limit meant those 
files would never be opened, so marking them all as "processed" (skipped) was 
accurate.                                                                    
                                                                                
                                                                                
                                                                                
                                                                      
     With WorkSource::Shared, len() returns the count of files still in the 
shared queue. Those files aren't skipped — sibling streams will pop and process 
them. So they get counted twice:                                                
                                                                          
                                                               
     1. The limiting partition adds them to files_processed (treating them as 
skipped)                                                                        
                                                                                
                                                                        
     2. The sibling that actually processes them adds 1 each as it finishes them
                                                                                
                                                                                
                                                                                
                                                                      
     In the test: 4 total files, partition 1 hits the limit and counts 1 + 3 = 
4, then partition 0 processes the remaining 3. Result: files_processed = 7 
instead of 4.                                                                   
                                                                            
                                                                                
                                                                                
                                                                                
                                                                      
     The fix: when hitting a limit with a shared work source, only count the 
current file (done = 1), not the shared queue remainder.  
                                                                                
                                                                                
                                                                                
                                                                      
     The fix would be in scan_state.rs:175 — for the WorkSource::Shared case, 
done should be 1 (only the current file), not 1 + shared_queue.len().     
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to