GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/18248

    Separation between spark's stats and hive's stats

    ## What changes were proposed in this pull request?
    
    Currently, hive's stats are read into `CatalogStatistics`, while spark's 
stats are also persisted through `CatalogStatistics`. Therefore, in 
`CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As 
a result, hive's stats can be unexpectedly propagated into spark' stats.
    
    For example, by using "ALTER TABLE" command, we will store the stats info 
(read from hive, e.g. "totalSize") in `CatalogStatistics` into metastore as 
spark's stats (because we don't know whether it's from spark or not). But 
spark's stats should be only generated by "ANALYZE" command. This is unexpected 
from this command.
    
    Besides, now that we store wrong spark's stats, after inserting new data, 
although hive updated "totalSize" in metastore, we still cannot get the right 
`sizeInBytes` in `CatalogStatistics`, because we respect the wrong spark stats 
over hive's stats.
    
    To fix this, we need to clearly separate spark's stats from hive's stats in 
`CatalogStatistics`.
    
    ## How was this patch tested?
    
    Modified existing tests and added a new test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark separateHiveStats

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18248.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18248
    
----
commit 0d56f165e4570f1b7a9908f4bceae55407e5cb03
Author: Zhenhua Wang <[email protected]>
Date:   2017-06-09T01:47:26Z

    separation between spark's stats and hive's stats

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to