davisp commented on PR #16080: URL: https://github.com/apache/datafusion/pull/16080#issuecomment-2993122679
Just to repeat some color commentary from previous threads, the way I originally found this was a query that would OOM a 32GiB machine with one of the TPC-H queries (17 or 18 if memory serves) if I went through the route that had disabled statistics vs not using more than 2GiB when statistics we’re collected. In terms of regressions, the only thing I can imagine is some scenario where folks are doing high frequency queries while somehow reopening each datasource on every query which I’d expect to be rare and manifest as “huh, this pipeline is now 10% slower”. Based on my (admittedly new and shallow) experience, it feels like DataFusion is geared more towards the first use case rather than the second so it feels like the “obvious” default is to collect statistics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org