Re: [I] Eliminate Repartitioning for Small Datasets [datafusion]

via GitHub Mon, 10 Nov 2025 08:53:55 -0800


gene-bordegaray commented on issue #18595:
URL: https://github.com/apache/datafusion/issues/18595#issuecomment-3512805842


   > > * we are optimistic with CSV files and return true since CSV return no 
statistics. We could look at adding a better way to estimate if a CSV file 
needs repartitioning at the file level
   > > * We have exact statistics on Parquet files, thus we do not round robin 
repartition at the file level, but we still hash repartition later on in the 
query. We can look at how hash repartitioning is handles based on file 
statistics and improve this.
   > 
   > I suggest we start with Parquet, since we already have precise statistics 
available. CSV is more complex and can be tackled separately once we’ve 
gathered the necessary stats.
   
   ok, and yes the parquet will be much less work. Should I create two separate 
issues for these?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Eliminate Repartitioning for Small Datasets [datafusion]

Reply via email to