asolimando commented on issue #23194: URL: https://github.com/apache/datafusion/issues/23194#issuecomment-4807999502
Similar work for distributed-dafafusion from @gabotechs: https://github.com/datafusion-contrib/datafusion-distributed/pull/486 I only looked at the issue/PR description for this issue, but I can already see many commonalities: - pipeline breaker/boundaries as a good place to accumulate runtime statistics - re-use of the built-in statistics propagation mechanism (great for re-use), only fueled with runtime statistics - runtime statistics must be sampled as we are in a streaming computational model (idea behind `SamplerExec` in the above PR, and the similar buffer node here) I wonder how much can be re-used across core DF/distributed DF/ballista, there are different challenges and the same logical concept has different forms in the three cases, but the mechanism seems very similar, if not identical. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
