Re: [PR] Make `SessionContext::register_parquet` obey `collect_statistics` config [datafusion]

via GitHub Fri, 20 Jun 2025 16:38:10 -0700


davisp commented on PR #16080:
URL: https://github.com/apache/datafusion/pull/16080#issuecomment-2993122679


   Just to repeat some color commentary from previous threads, the way I 
originally found this was a query that would OOM a 32GiB machine with one of 
the TPC-H queries (17 or 18 if memory serves) if I went through the route that 
had disabled statistics vs not using more than 2GiB when statistics we’re 
collected.
   
   In terms of regressions, the only thing I can imagine is some scenario where 
folks are doing high frequency queries while somehow reopening each datasource 
on every query which I’d expect to be rare and manifest as “huh, this pipeline 
is now 10% slower”.
   
   Based on my (admittedly new and shallow) experience, it feels like 
DataFusion is geared more towards the first use case rather than the second so 
it feels like the “obvious” default is to collect statistics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Make `SessionContext::register_parquet` obey `collect_statistics` config [datafusion]

Reply via email to