NGA-TRAN commented on PR #19124: URL: https://github.com/apache/datafusion/pull/19124#issuecomment-3628729174
@asolimando > However, I think we should also consider the impact of data skew. For heavily skewed tables, preserving file partitioning can make the scan itself significantly unbalanced (one or a few partition groups doing most of the I/O), and in those cases you might actually prefer to pay the shuffle cost rather than constrain execution to the file partition layout. That’s a very good point. I think we can frame this as two distinct scenarios: 1. **Standard input data** — when the input consists of Parquet files (and similar formats) and want DataFusion handles skew, your suggestions fit perfectly. 2. **Customized input data** — when the input is specialized and DataFusion lacks sufficient statistics to manage skew, or when the data has its own structure and skew‑handling logic. In these cases, users may want to preserve partitions as‑is for performance or correctness, and we should avoid intervening. Overall, I believe we should support both options: let DataFusion operate as a library while giving users the flexibility to decide how they want to handle their data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
