andygrove commented on PR #1777:
URL: 
https://github.com/apache/datafusion-ballista/pull/1777#issuecomment-4545633320

   Good call. Pushed `ballista/client/tests/multi_file_scan.rs` (commit 
b68f73e3) with two standalone-cluster regression tests that exercise multi-file 
parquet scans:
   
   - `multi_file_parquet_scan_counts_every_row_exactly_once` — writes 6 parquet 
files, 7 rows each (42 rows total), and asserts `SELECT COUNT(*), SUM(value)` 
returns 42 / `sum(0..42)`.
   - `multi_file_parquet_group_by_returns_each_value_once` — `GROUP BY value` 
after the scan and asserts every key shows up exactly once.
   
   Both tests fail under this branch and I left them `#[ignore]`d so they 
document the failure mode without blocking CI. The first one returns 252 rows 
instead of 42 (= 6 tasks × 42 rows). The metrics on the leaf stage make the 
cause explicit: `files_opened=36, files_processed=36` for 6 input files.
   
   Tracing it back to DataFusion 54: `FileScanConfig::create_sibling_state` now 
hands out a `SharedWorkSource` populated with every file in the scan, and 
`FileScanConfig::open_with_args` wires that shared queue into the partition's 
`FileStreamBuilder`. In a single-process DataFusion run that's safe — every 
partition of the same DataSourceExec instance shares one queue and they drain 
it cooperatively. Under Ballista each task deserialises its *own* copy of the 
plan, owns its own shared queue containing every file, and executes a single 
partition that drains the whole queue locally. So this isn't quite the same 
shape as the bug datafusion-distributed hit (`PartitionIsolatorExec` using 
`task_index` at execution), but the root cause — DF 54 no longer assuming 
pre-baked partition→files at execution — bites Ballista too.
   
   Likely fix is the same shape as datafusion-distributed PR #467: pre-split 
`FileScanConfig.file_groups` per task before serialising the plan, so each task 
ships with a single-partition config and the shared queue only contains that 
partition's files. I'd prefer to land that as a follow-up PR rather than expand 
this one; happy to file an issue and link the test if that works for you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to