dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1980642854

   > > I don't think it fixes the issue completely and there are some problems 
with the solution. I believe a proper solution is in the following comment: 
[#45181 
(comment)](https://github.com/apache/spark/pull/45181#issuecomment-1969241145)
   > > Also many more tests are needed.
   > 
   > Based on this solution, I think we should def `queryExecution` like this:
   > 
   > ```scala
   > private var queryPersisted: Option[(LogicalPlan, QueryExecution)] = None
   > 
   > def queryExecution: QueryExecution = {
   >     val cacheManager = queryExec.sparkSession.sharedState.cacheManager
   >     val plan = queryExec.logical
   >     plan.find({
   >       case _: IgnoreCachedData => false
   >       case currentFragment => 
cacheManager.lookupCachedData(currentFragment).isDefined
   >     }) match {
   >       // if we can't find cached plan, we directly return the queryExec
   >       case None =>
   >         queryPersisted = None
   >         queryExec
   >       // we find children plan is cached, we make sure that queryPersisted 
is consistent
   >       // with the cachedPlan
   >       case Some(cachedPlan) =>
   >         queryPersisted match {
   >           case None =>
   >             val qe = new QueryExecution(queryExec)
   >             queryPersisted = Some(cachedPlan, qe)
   >             qe
   >           case Some((prevCachedPlan, prevCachedQe)) =>
   >             if (prevCachedPlan.sameResult(cachedPlan)) {
   >               prevCachedQe
   >             } else {
   >               val qe = new QueryExecution(queryExec)
   >               // refresh the cached queryPersisted
   >               queryPersisted = Some(cachedPlan, qe)
   >               qe
   >             }
   >         }
   >     }
   >   }
   > ```
   > 
   > Whenever the ds or the child dataset is persisted or unpersisted, we can 
get newest consistent result.
   
   You're correct that we have to consider the current persistence states of 
the children nodes :+1: 
   If persistence states are changed concurrently then inconsistencies are 
inevitable.
   `QueryExecution` must be recalculated each time anything changes underneath. 
It seems like it should be done in `QueryExecution` itself, not in `Dataset` 
(once `withCachedData` changes then `withCachedData`, `optimizedPlan`, 
`sparkPlan`, `executedPlan` need to be recalculated). 
   I think we should start with comprehensive tests and then implement a 
suitable fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to