GitHub user aokolnychyi opened a pull request:
https://github.com/apache/spark/pull/19252
[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable
## What changes were proposed in this pull request?
Tables in the catalog cache are not invalidated once their statistics are
updated. As a consequence, existing sessions will use the cached information
even though it is not valid anymore. Consider and an example below.
```
// step 1
spark.range(100).write.saveAsTable("tab1")
// step 2
spark.sql("analyze table tab1 compute statistics")
// step 3
spark.sql("explain cost select distinct * from tab1").show(false)
// step 4
spark.range(100).write.mode("append").saveAsTable("tab1")
// step 5
spark.sql("explain cost select distinct * from tab1").show(false)
```
After step 3, the table will be present in the catalog relation cache. Step
4 will correctly update the metadata inside the catalog but will NOT invalidate
the cache.
By the way, ``spark.sql("analyze table tab1 compute statistics")`` between
step 3 and step 4 would also solve the problem.
## How was this patch tested?
Current and additional unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aokolnychyi/spark spark-21969
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19252.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19252
----
commit ba963b46cd2917315bc2bd0cf237c7d9f79e9d65
Author: aokolnychyi <[email protected]>
Date: 2017-09-16T11:57:52Z
[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]