juliuszsompolski opened a new pull request, #48211:
URL: https://github.com/apache/spark/pull/48211

   ### What changes were proposed in this pull request?
   
   Currently, when evaluation of `lazy val` of some of the plans fails in 
QueryExecution, this `lazy val` remains not initialized, and another attempt 
will be made to initialize it the next time it's referenced. This leads to 
planning being performed multiple times, resulting in inefficiencies, and 
potential duplication of side effects.
   
   ### Why are the changes needed?
   
   Current behaviour leads to inefficiencies and subtle problems in accidental 
situations, for example when plans are accessed for logging purposes.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   This change would bring slight behaviour changes:
   
   Examples:
   ```
   val df = a.join(b)
   spark.conf.set(“spark.sql.crossJoin.enabled”, “false”)
   try { df.collect() } catch { case _ => }
   spark.conf.set(“spark.sql.crossJoin.enabled”, “true”)
   df.collect()
   ```
   This will succeed, because the first time around the plan will not be 
initialized because it threw an error because of the cartprod, and the second 
time around it will try to initialize it again and pick up the new config.
   ```
   val df = a.join(b)
   spark.conf.set(“spark.sql.crossJoin.enabled”, “true”)
   df.collect()
   spark.conf.set(“spark.sql.crossJoin.enabled”, “false”)
   df.collect()
   ```
   This will also succeed, because the second collect() will reuse the plan 
initialized by the first one and ignore the config change.
   
   The current semantics is: "If plan evaluation fails, try again next time 
it's accessed. If plan evaluation ever succeeded, keep that plan.". So it's "at 
most once".
   
   Spark 4.0 may be a good candidate for a slight change in this, to make sure 
that we don't re-execute the optimizer, and potential side effects of it.
   
   Note: These behaviour changes have already happened in Spark Connect mode, 
where the Dataset object is not reused across execution. This change makes 
Spark Classic and Spark Connect behave the same again.
   
   ### How was this patch tested?
   
   Existing tests shows no issues, except for the tests that exhibit the 
behaviour change described above.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Trivial code completion suggestions.
   Generated-by: Github Copilot


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to