xudong963 commented on PR #21815:
URL: https://github.com/apache/datafusion/pull/21815#issuecomment-4352740602
Thanks for the thoughtful response @asolimando — the framing is exactly
right, and the prior discussion with @kosiew in #21483 is helpful context.
**On scope**: agreed, let's land per-call caching in this PR (your Option 1)
and treat cross-call caching with stable node IDs as a follow-up. Could you
open an issue for Option 2 so we don't lose track?
**On the cache key**: (Arc::as_ptr, partition) is safe within a single
synchronous compute_statistics walk — the Arcs are held by the plan tree and
can't be dropped during the call, so pointer reuse isn't a concern. Good call.
**On benchmarks**: I'd avoid full TPC-DS Q99 — statistics computation is a
small fraction of total query time and will get lost in noise. A targeted
micro-bench is more informative:
- Build a deeply nested plan (e.g., a 10+ deep UnionExec chain, or a chain
of hash joins + repartitions) and time compute_statistics(plan, None)
before/after this PR.
- Optionally reuse a reproducer from #19795 (planning-speed EPIC) since
deep plans are exactly that issue's pain point.
That should cleanly demonstrate the gain.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]