rdettai edited a comment on pull request #1347: URL: https://github.com/apache/arrow-datafusion/pull/1347#issuecomment-975307896
Thanks for spotting this @houqp. I plan on working on the default values soon, as I find them to be a bit confused. They are sourced from multiple places and thus hard to document / understand. I don't mind switching the default for `collect_stat` back to true. I tried not to change the default behaviors but it seems that this one slipped away. > For serious production use-cases, collecting stats should make a big different in performance for cases where it could help Sadly things are not that simple 😅. There are some cases where this wouldn't be true. The `collect_stats` parameter you are referring to here decides whether statistics should be fetched **during the planning**. This is meant primarily to enable Cost Based Optimizations. In a distributed setup like Ballista, if you activate statistics fetching, as the planning is made on the scheduler node, it implies that a single node will need to open all the files/objects to read the stats. This will be very long if there are many files. It might actually be faster to just get the list of files and distribute the work across nodes (especially if you have many nodes). If what you want is not necessarily to get the statistics during the planning but only have them when executing the `ExecutionPlan` to enable row group pruning, I think that should be a separate configuration (or maybe we don't even need a configuration, for formats like Parquet we should **always** get and use the statistics). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org