Thanks Andrew for bringing this PR forward. I would just like to give the big picture that led to this modification.
We would like to make Datafusion more efficiently integratable with table formats. I have recently written a design document in that sense [1] that goes through the various ways we could achieve that. The conclusion from that document is that the best moment to query the table format for the list of data files and statistics is at the moment of the conversion from logical plan to physical plan. This is why one of the first steps is to move the statistics out of the logical plan abstraction (because they are unknown at the moment it is built) and bring them into the physical plan. Making optimizations on the physical plan comes with its fair share of challenges. The physical plan is much more complex and can take many more shapes than the logical plan. This will definitely be a big challenge when adding new optimizations rules and maintaining existing ones. But it also unlocks a great potential, in particular in terms of adaptive query optimizations, that will become fairly easy to implement thanks to this change. Your feedback is very welcome, both on the PR and the design document. Remi [1] https://docs.google.com/document/d/1Bd4-PLLH-pHj0BquMDsJ6cVr_awnxTuvwNJuWsTHxAQ/edit?usp=sharing Le ven. 10 sept. 2021 à 20:45, Andrew Lamb <al...@influxdata.com> a écrit : > I would like to draw some attention to a PR [1] that proposes to move > statistics (and the cost based optimizations that rely on them) into the > physical planning realm. > > The feedback so far on this change seems positive but since it may have > non-trivial architectural impact on downstream projects that use > statistics, I wanted to give it some extra visibility. > > Thanks, > Andrew > > [1] https://github.com/apache/arrow-datafusion/pull/965 >