comphead commented on issue #3817:
URL:
https://github.com/apache/datafusion-comet/issues/3817#issuecomment-4193279630
Spark's split logic guarantees each row is counted exactly once. The
`prune_by_range check` is a performance optimization, not a correctness gate.
If a row group's start
offset falls within the range, that executor reads it; otherwise it skips
it. Since RG1 starts at ~122MB (< 128MB split point), chunk 1 always claims it,
and chunk 2 never does.
The problem is performance waste:
```
┌──────────┬─────────────────────────────┬────────────────────────┐
│ │ Chunk 1 (0..128MB) │ Chunk 2 (128MB..190MB) │
├──────────┼─────────────────────────────┼────────────────────────┤
│ Intended │ Read RG0 (~273K rows) │ Read RG1 (~152K rows) │
├──────────┼─────────────────────────────┼────────────────────────┤
│ Actual │ Reads RG0 + RG1 (425K rows) │ Reads nothing (0 rows) │
└──────────┴─────────────────────────────┴────────────────────────┘
```
- Chunk 1 executors do ~1.5x the intended work (read both row groups)
- Chunk 2 executor wastes ~800ms loading metadata only to return 0 rows
- Across 1800 partitions, this means hundreds of tasks doing zero useful
work
Root cause: DataFusion's `prune_by_range` decides row group ownership by
checking if col0_dict_page_offset falls within [range.start, range.end). When
both RGs start before the 128MB split point, chunk 1
gets both and chunk 2 gets none.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]