comphead commented on PR #21351: URL: https://github.com/apache/datafusion/pull/21351#issuecomment-4255129097
I asked Claude to write some tests on Shared mode and it found some
potential concerns
```
When a partition hits its row limit, scan_state.rs:175 runs:
let done = 1 + self.work_source.len();
self.metrics.files_processed.add(done);
This was correct before work stealing — work_source.len() returned the
count of files owned exclusively by this stream. Hitting the limit meant those
files would never be opened, so marking them all as "processed" (skipped) was
accurate.
With WorkSource::Shared, len() returns the count of files still in the
shared queue. Those files aren't skipped — sibling streams will pop and process
them. So they get counted twice:
1. The limiting partition adds them to files_processed (treating them as
skipped)
2. The sibling that actually processes them adds 1 each as it finishes them
In the test: 4 total files, partition 1 hits the limit and counts 1 + 3 =
4, then partition 0 processes the remaining 3. Result: files_processed = 7
instead of 4.
The fix: when hitting a limit with a shared work source, only count the
current file (done = 1), not the shared queue remainder.
The fix would be in scan_state.rs:175 — for the WorkSource::Shared case,
done should be 1 (only the current file), not 1 + shared_queue.len().
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
