adriangb commented on PR #22343: URL: https://github.com/apache/datafusion/pull/22343#issuecomment-4594117921
The cost based optimizer discussion is interesting, but I worry that for analytical systems runtime adaptivity is mort impactful. Cost based optimizers are only as good as your estimates, which are going to be hard to make good if we don't have persistent stats, users write custom UDFs and increasingly AI is writing crazy complex queries. Cost based optimizers are essential for transactional systems where the average rows per query may be 1, but for the analytical workloads that people tend to run with DataFusion the average rows processed is in the tens if not hundreds of thousands (otherwise batch sizes of 8k and row groups of 1M would make no sense). This gives us an opportunity: if a 1s query is acceptable then running for 100ms / a small fraction of the query in a sub-optimal way while we gather stats but then making the other 900ms take 500ms is a big win. We'll see enough data that we can derive relatively high quality runtime statistics. AFAIK Ballista, Spark, others I don't remember off the top of my head have runtime adaptivity. They may also have a cost based optimizer though 😄. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
