rdettai edited a comment on pull request #1347:
URL: https://github.com/apache/arrow-datafusion/pull/1347#issuecomment-975307896


   Thanks for spotting this @houqp. I plan on working on the default values 
soon, as I find them to be a bit confused. They are sourced from multiple 
places and thus hard to document / understand. I don't mind switching the 
default for `collect_stat` back to true. I tried not to change the default 
behaviors but it seems that this one slipped away.
   
   > For serious production use-cases, collecting stats should make a big 
different in performance for cases where it could help
   
   Sadly things are not that simple 😅. There are some cases where this wouldn't 
be true. The `collect_stats` parameter you are referring to here defines 
whether statistics should be fetched **during the planning**. This is meant 
primarily to enable Cost Based Optimizations. In a distributed setup like 
Ballista, if you activate statistics fetching, as the planning is made on the 
scheduler node, it implies that a single node will need to open all the 
files/objects to read the stats. This will be very long if there are many 
files. It might actually be faster to just get the list of files and distribute 
the work across nodes (especially if you have many nodes). If what you want is 
not necessarily to get the statistics during the planning but only have them 
when executing the `ExecutionPlan` to enable row group pruning, I think that 
should be a separate configuration (or maybe we don't even need a 
configuration, for formats like Parquet we should **always** get the 
file/row_group level
  and use them).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to