dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1980642854
> > I don't think it fixes the issue completely and there are some problems with the solution. I believe a proper solution is in the following comment: [#45181 (comment)](https://github.com/apache/spark/pull/45181#issuecomment-1969241145) > > Also many more tests are needed. > > Based on this solution, I think we should def `queryExecution` like this: > > ```scala > private var queryPersisted: Option[(LogicalPlan, QueryExecution)] = None > > def queryExecution: QueryExecution = { > val cacheManager = queryExec.sparkSession.sharedState.cacheManager > val plan = queryExec.logical > plan.find({ > case _: IgnoreCachedData => false > case currentFragment => cacheManager.lookupCachedData(currentFragment).isDefined > }) match { > // if we can't find cached plan, we directly return the queryExec > case None => > queryPersisted = None > queryExec > // we find children plan is cached, we make sure that queryPersisted is consistent > // with the cachedPlan > case Some(cachedPlan) => > queryPersisted match { > case None => > val qe = new QueryExecution(queryExec) > queryPersisted = Some(cachedPlan, qe) > qe > case Some((prevCachedPlan, prevCachedQe)) => > if (prevCachedPlan.sameResult(cachedPlan)) { > prevCachedQe > } else { > val qe = new QueryExecution(queryExec) > // refresh the cached queryPersisted > queryPersisted = Some(cachedPlan, qe) > qe > } > } > } > } > ``` > > Whenever the ds or the child dataset is persisted or unpersisted, we can get newest consistent result. You're correct that we have to consider the current persistence states of the children nodes :+1: If persistence states are changed concurrently then inconsistencies are inevitable. `QueryExecution` must be recalculated each time anything changes underneath. It seems like it should be done in `QueryExecution` itself, not in `Dataset` (once `withCachedData` changes then `withCachedData`, `optimizedPlan`, `sparkPlan`, `executedPlan` need to be recalculated). I think we should start with comprehensive tests and then implement a suitable fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
