adriangb commented on PR #22343:
URL: https://github.com/apache/datafusion/pull/22343#issuecomment-4594117921

   The cost based optimizer discussion is interesting, but I worry that for 
analytical systems runtime adaptivity is mort impactful. Cost based optimizers 
are only as good as your estimates, which are going to be hard to make good if 
we don't have persistent stats, users write custom UDFs and increasingly AI is 
writing crazy complex queries.
   
   Cost based optimizers are essential for transactional systems where the 
average rows per query may be 1, but for the analytical workloads that people 
tend to run with DataFusion the average rows processed is in the tens if not 
hundreds of thousands (otherwise batch sizes of 8k and row groups of 1M would 
make no sense). This gives us an opportunity: if a 1s query is acceptable then 
running for 100ms / a small fraction of the query in a sub-optimal way while we 
gather stats but then making the other 900ms take 500ms is a big win. We'll see 
enough data that we can derive relatively high quality runtime statistics. 
AFAIK Ballista, Spark, others I don't remember off the top of my head have 
runtime adaptivity. They may also have a cost based optimizer though 😄.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to