andygrove opened a new pull request, #3947: URL: https://github.com/apache/datafusion-comet/pull/3947
## Which issue does this PR close? Closes #3817. ## Rationale for this change When Comet uses `native_datafusion` scan mode, DataFusion's built-in `prune_by_range` uses a different algorithm than Spark/parquet-mr to assign row groups to file splits: - **Spark/parquet-mr/parquet-rs**: Uses the **midpoint** of a row group (`start_offset + compressed_size / 2`) to determine ownership. A row group belongs to a split if its midpoint falls within `[split_start, split_end)`. - **DataFusion**: Uses the **start offset** (`column(0).dictionary_page_offset` or `data_page_offset`). A row group belongs to a split if its start offset falls within the range. When these algorithms disagree (e.g., a row group starts before a split boundary but its midpoint is after it), some tasks end up reading too many row groups while others read none. This wastes cluster parallelism — in the reported case, 600 out of 1800 tasks were idle. ## What changes are included in this PR? Two new functions in `native/core/src/parquet/parquet_exec.rs`: - `get_row_group_midpoint(rg)` — Computes the midpoint offset of a row group using the same algorithm as Spark/parquet-mr and parquet-rs. - `apply_midpoint_row_group_pruning(file_groups, store)` — For each `PartitionedFile` with a byte range, reads the Parquet footer, computes which row groups have their midpoint within the range, creates a `ParquetAccessPlan` with those row groups, and removes the byte range. This causes DataFusion to use the explicit access plan and skip its built-in `prune_by_range`. The function is called in `init_datasource_exec` and short-circuits early if no files have ranges (no overhead for non-split files). Note: this is a Comet-side workaround. The upstream fix would be to change DataFusion's `prune_by_range` to use midpoint-based assignment. ## How are these changes tested? This needs testing with splittable Parquet files on a cluster (HDFS) where files are large enough to be split into multiple tasks. The issue could not be reproduced locally with local filesystem. Existing test suites verify no regression for the common case where files are not split. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
