gene-bordegaray commented on issue #18595: URL: https://github.com/apache/datafusion/issues/18595#issuecomment-3512805842
> > * we are optimistic with CSV files and return true since CSV return no statistics. We could look at adding a better way to estimate if a CSV file needs repartitioning at the file level > > * We have exact statistics on Parquet files, thus we do not round robin repartition at the file level, but we still hash repartition later on in the query. We can look at how hash repartitioning is handles based on file statistics and improve this. > > I suggest we start with Parquet, since we already have precise statistics available. CSV is more complex and can be tackled separately once we’ve gathered the necessary stats. ok, and yes the parquet will be much less work. Should I create two separate issues for these? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
