asolimando commented on PR #21815:
URL: https://github.com/apache/datafusion/pull/21815#issuecomment-4351618242

   > @asolimando thanks! I'm sorry that I'm busy with others this week.
   > 
   > This PR doesn't fully solve the problem it claims to. The stated goal in 
the PR description and #20184 is to eliminate exponential recomputation. But 
for any plan containing a `CoalescePartitionsExec`, `SortPreservingMergeExec`, 
`RepartitionExec`, `HashJoinExec` (CollectLeft/Auto), `CrossJoinExec`, or 
`NestedLoopJoinExec` — which is most non-trivial plans — the operator restarts 
a _fresh_ bottom-up walk from inside its own `partition_statistics` IIUC. So 
the recomputation isn't gone;
   > 
   > Caching sounds good, how about making caching part of `StatisticsContext` 
from day one, then we can have some benchmarks to show off the gains which will 
be easier for the community to accept the PR, wdyt?
   
   Thank you for your input @xudong963, no need to apologies, it's 
understandable!
   
   You raise a fair point, we fully avoid the recomputation only for linear 
plans, but operators that call `compute_statistics(child, None)` internally 
don't benefit. This is noted in the "What remains for follow-up" section but I 
agree it might not be enough for the first iteration, and I anyway should have 
marked "partially closes https://github.com/apache/datafusion/issues/20184";.
   
   Re. the cache, I identified the need for the `StatisticsRegistry` already, 
and we discussed with @kosiew in the related PR (#21483, 
[comment](https://github.com/apache/datafusion/pull/21483/files#r2033980925), 
branch `asolimando/statistics-planner-with-statscache-v2`). We agreed to defer 
it to limit scope, but this is the right place to discuss it.
   
   One limitation I identified on the `StatsCache` (as I called it there), is 
around the cache key, which should "identify" an `ExecutionPlan`, which doesn't 
have any stable id other than its memory pointer ( so the cache key is 
effectively `(Arc::as_ptr, partition)`), but I am concerned of nodes being 
disposed (and re-used).
   
   Cache lifecycle/scope:
   1. single invocation of `compute_statistics` (as described in  
https://github.com/apache/datafusion/issues/20184): if we agree on this, then 
the concern is not valid, as the plan tree is "stable" during the lifetime. 
When e.g. `CoalescePartitionsExec` calls `compute_statistics(child, None)` 
internally, the cache already has the subtree results, fully eliminating 
redundant walks.
   
   2. multiple invocations of `compute_statistics` (same rule or cross-rules): 
here we necessarily need a stable node ID and we can't rely on the pointer, 
since nodes can be dropped/recreated
   
   The scope of #20184 is, in my understanding, 1. (single walk), if you agree 
with that, I plan to use `(Arc::as_ptr, partition)` as cache key, and 
introducing node IDs and expanding the cache lifetime IMO be tackled as a 
followup (I can create issues for that, if the direction is confirmed), as with 
this solution we should already see computational benefits.
   
   Re. benchmarks, do you have a specific workload in mind (e.g., TPC-DS, Q99)? 
Also, could I be added to the allowlist to trigger benchmark runs so I can 
iterate without requiring manual re-runs, in case I need multiple iterations?
   
   WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to