davisp commented on PR #16080:
URL: https://github.com/apache/datafusion/pull/16080#issuecomment-2993122679

   Just to repeat some color commentary from previous threads, the way I 
originally found this was a query that would OOM a 32GiB machine with one of 
the TPC-H queries (17 or 18 if memory serves) if I went through the route that 
had disabled statistics vs not using more than 2GiB when statistics we’re 
collected.
   
   In terms of regressions, the only thing I can imagine is some scenario where 
folks are doing high frequency queries while somehow reopening each datasource 
on every query which I’d expect to be rare and manifest as “huh, this pipeline 
is now 10% slower”.
   
   Based on my (admittedly new and shallow) experience, it feels like 
DataFusion is geared more towards the first use case rather than the second so 
it feels like the “obvious” default is to collect statistics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to