andygrove commented on issue #3817:
URL:
https://github.com/apache/datafusion-comet/issues/3817#issuecomment-4246263540
fwiw, Claude analysis of options to fix:
Options to fix
Option 1: Fix in DataFusion upstream (best long-term)
DataFusion's row group range pruning could be improved to assign each row
group to exactly one split, e.g., by checking if the row group's byte range
overlaps the split range rather than just checking the start
offset. This would be a contribution to the
https://github.com/apache/datafusion project.
Option 2: Override row group selection in Comet's ParquetSource
Comet already creates a custom ParquetSource. You could implement a custom
ParquetAccessPlan or row group filter that uses Spark's exact split boundaries
to decide ownership. DataFusion's ParquetSource supports
with_row_group_filter() — you could provide a filter that says "only read
row groups whose midpoint (or start of data) falls in my range," matching
Spark's assignment logic.
Option 3: Pre-split at the Spark level to align with row groups
In CometNativeScan serialization, before sending ranges to native, adjust
the ranges to align with Parquet row group boundaries. This would require
reading Parquet metadata on the JVM side (which Spark already does
for count() — explaining why count() works correctly).
Option 4: Post-filter on native side
After DataFusion reads row groups, add deduplication logic so that when a
row group spans two splits, only one split processes it. This is fragile but
doesn't require upstream changes.
Most practical path
Option 2 is probably the most practical near-term fix. DataFusion's
ParquetSource has hooks for customizing row group selection. You'd implement
Spark's exact row group assignment logic: a row group belongs to a
split if its offset falls within [split.start, split.start +
split.length). This way, each task reads exactly the row groups Spark intended
it to read, and no task ends up idle.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]