dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1976817040

   > What're your ideas?
   
   Looking at the code I think it's going to work and will fix the issue.
   
   I do have a couple of questions though:
   
   1. We'll have an asymmetry: `QueryExecution` instance for the unpersisted 
state is cached/reused in the `Dataset`, but for the persisted state it's 
always new (never cached/reused). Do we want to use a "cached" `QueryExecution` 
instance for performance reasons or it doesn't matter much? Regardless of the 
answer I think it makes sense to use the same approach for both `Dataset` 
states (persisted and unpersisted).
   
   If we want to reuse `QueryExecution` instances then we'll need to have both 
instances in the constructor:
   ```
   class Dataset[T] private[sql](
       @DeveloperApi @Unstable @transient val queryExecutionUnpersisted: 
QueryExecution,
       @DeveloperApi @Unstable @transient val queryExecutionPersisted: 
QueryExecution,
   ```
   
   If we don't need to reuse then we should probably have these values instead 
and always create new `QueryExecution` instances regardless of the `Dataset` 
persistence state:
   ```
   class Dataset[T] private[sql](
       val sparkSession: SparkSession,
       val logical: LogicalPlan,
       val tracker: QueryPlanningTracker = new QueryPlanningTracker,
       val mode: CommandExecutionMode.Value = CommandExecutionMode.ALL
   ```
   
   Even though the latter approach is simpler, I suspect that we want to reuse 
`QueryExecution` instances to avoid doing the same work over and over again 
(the plan analysis).
   
   
   2. Method `withAction` accepts `QueryExecution` so logically it's expected 
to be used, but it may change it based on the `Dataset` persistence state. Does 
the additional responsibility belong to `withAction` method? It seems implicit 
to me: nothing in the name or arguments hints at that. Is this the only place 
where we need different `QueryExecution` instances? The additional 
responsibility is independent so it should be in a separate method which 
provides proper `QueryExecution`: if we do that then we'll get something 
similar to `def queryExecution: QueryExecution` method above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to