NGA-TRAN commented on PR #19124:
URL: https://github.com/apache/datafusion/pull/19124#issuecomment-3628729174

   @asolimando 
   
   > However, I think we should also consider the impact of data skew. For 
heavily skewed tables, preserving file partitioning can make the scan itself 
significantly unbalanced (one or a few partition groups doing most of the I/O), 
and in those cases you might actually prefer to pay the shuffle cost rather 
than constrain execution to the file partition layout.
   
   That’s a very good point. I think we can frame this as two distinct 
scenarios:
   
   1. **Standard input data** — when the input consists of Parquet files (and 
similar formats) and want DataFusion  handles skew, your suggestions fit 
perfectly.
   2. **Customized input data** — when the input is specialized and DataFusion 
lacks sufficient statistics to manage skew, or when the data has its own 
structure and skew‑handling logic. In these cases, users may want to preserve 
partitions as‑is for performance or correctness, and we should avoid 
intervening.
   
   Overall, I believe we should support both options: let DataFusion operate as 
a library while giving users the flexibility to decide how they want to handle 
their data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to