liupengcheng created SPARK-30470:
------------------------------------

             Summary: Uncache table in tempViews if needed on session closed
                 Key: SPARK-30470
                 URL: https://issues.apache.org/jira/browse/SPARK-30470
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.2
            Reporter: liupengcheng


Currently, Spark will not cleanup cached tables in tempViews produced by sql 
like following

`CACHE TABLE table1 as SELECT ....`

There are risks that the `uncache table` not called due to session closed 
unexpectedly, or user closed manually. Then these temp views will lost, and we 
can not visit them in other session, but the cached plan still exists in the 
`CacheManager`.

Moreover, the leaks may cause the failure of the subsequent query, one failure 
we encoutered in our production environment is as below:
{code:java}
Caused by: java.io.FileNotFoundException: File does not exist: 
/user/xxxx/xx/data__db60e76d_91b8_42f3_909d_5c68692ecdd4Caused by: 
java.io.FileNotFoundException: File does not exist: 
/user/xxxx/xx/data__db60e76d_91b8_42f3_909d_5c68692ecdd4It is possible the 
underlying files have been updated. You can explicitly invalidate the cache in 
Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the 
Dataset/DataFrame involved. at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:131)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0.scan_nextBatch_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
{code}
The above exception happens when user update the data of the table, but spark 
still use the old cached plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to