andygrove commented on issue #3817:
URL: 
https://github.com/apache/datafusion-comet/issues/3817#issuecomment-4246263540

   fwiw, Claude analysis of options to fix:
   
    Options to fix                                                              
                                                                                
                                                           
                                                                                
                                                                                
                                                            
     Option 1: Fix in DataFusion upstream (best long-term)                      
                                                                                
                                                            
      
     DataFusion's row group range pruning could be improved to assign each row 
group to exactly one split, e.g., by checking if the row group's byte range 
overlaps the split range rather than just checking the start     
     offset. This would be a contribution to the 
https://github.com/apache/datafusion project.
                                                                                
                                                                                
                                                            
     Option 2: Override row group selection in Comet's ParquetSource            
                                                                                
                                                            
      
     Comet already creates a custom ParquetSource. You could implement a custom 
ParquetAccessPlan or row group filter that uses Spark's exact split boundaries 
to decide ownership. DataFusion's ParquetSource supports     
     with_row_group_filter() — you could provide a filter that says "only read 
row groups whose midpoint (or start of data) falls in my range," matching 
Spark's assignment logic.
                                                                                
                                                                                
                                                            
     Option 3: Pre-split at the Spark level to align with row groups            
                                                                                
                                                            
      
     In CometNativeScan serialization, before sending ranges to native, adjust 
the ranges to align with Parquet row group boundaries. This would require 
reading Parquet metadata on the JVM side (which Spark already does 
     for count() — explaining why count() works correctly).                     
                                                                                
                                                                                
                                                            
     Option 4: Post-filter on native side                                       
   
     After DataFusion reads row groups, add deduplication logic so that when a 
row group spans two splits, only one split processes it. This is fragile but 
doesn't require upstream changes.                               
      
     Most practical path                                                        
                                                                                
                                                            
                                                                                
     Option 2 is probably the most practical near-term fix. DataFusion's 
ParquetSource has hooks for customizing row group selection. You'd implement 
Spark's exact row group assignment logic: a row group belongs to a    
     split if its offset falls within [split.start, split.start + 
split.length). This way, each task reads exactly the row groups Spark intended 
it to read, and no task ends up idle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to