[PR] [SPARK-54812][SQL][4.1] Make executable commands not execute on resultDf.cache() [spark]

via GitHub Thu, 29 Jan 2026 14:52:26 -0800


szehon-ho opened a new pull request, #54064:
URL: https://github.com/apache/spark/pull/54064


   
   
   ### What changes were proposed in this pull request?
   Backport of #53572 to 4.1 branch
   
   Follow up of https://github.com/apache/spark/pull/51032 . That pr changed 
V2WriteCommand not to execute eagerly on df.cache(). However, there are a bunch 
of other commands that still do.
   
   ```
   val df = sql("CREATE TABLE...")
   df.cache()  // executes again, fails with TableAlreadyExistsException
   ```
   
   This patch skip CacheManager for all Command, because these are 
eagerly-executed already when first calling sql("COMMAND").  
   
   
   ```
   val df = sql("SHOW TABLES.")
   sql("CREATE TABLE foo")
   df.cache()  // executes again and df now includes foo
   ```
   
   ### Why are the changes needed?
   To prevent the command with side-effect from being executed again if a user 
runs df.cache on the result of the command.  Many are dangerous as they would 
be running a second time without the user expectation (df.cache triggering 
another action on the table)
   
   ### Does this PR introduce _any_ user-facing change?
   If the user created a resultDF from a command, and then ran resultDf.cache, 
it used to re-run the command.  Now it will no-op.  Most of the time, this is 
beneficial as re-running the command will result in an error, or worse data 
corruption.  However, in some small cases , like SHOW TABLES or SHOW 
NAMESPACES, it will affect the contents of resultDf as it will no longer 
refresh when calling resultDf.cache()
   
   
   Note:  In most cases, we are lucky and will not see user-facing change.  
This is because commands, like for example DescribeTableExec plan node, already 
has a in-memory reference to Table object and keeps the old result despite 
repeated execution.  However, SHOW XXX command plans do not cache in memory 
results so they see some effect.
   
   
   ### How was this patch tested?
   Existing unit test, add new unit tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-54812][SQL][4.1] Make executable commands not execute on resultDf.cache() [spark]

Reply via email to