[
https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhenhua Wang updated SPARK-21031:
---------------------------------
Summary: Add `alterTableStats` to store spark's stats and let `alterTable`
keep existing stats (was: Clearly separate hive stats and spark stats in
catalog)
> Add `alterTableStats` to store spark's stats and let `alterTable` keep
> existing stats
> -------------------------------------------------------------------------------------
>
> Key: SPARK-21031
> URL: https://issues.apache.org/jira/browse/SPARK-21031
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Zhenhua Wang
>
> Currently, hive's stats are read into `CatalogStatistics`, while spark's
> stats are also persisted through `CatalogStatistics`. As a result, hive's
> stats can be unexpectedly propagated into spark' stats.
> For example, for a catalog table, we read stats from hive, e.g. "totalSize"
> and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we
> will store the stats in `CatalogStatistics` into metastore as spark's stats
> (because we don't know whether it's from spark or not). But spark's stats
> should be only generated by "ANALYZE" command. This is unexpected from this
> command.
> Secondly, now that we have spark's stats in metastore, after inserting new
> data, although hive updated "totalSize" in metastore, we still cannot get the
> right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats
> (should not exist) over hive's stats.
> {code}
> spark-sql> create table xx(i string, j string);
> spark-sql> insert into table xx select 'a', 'b';
> spark-sql> desc formatted xx;
> # col_name data_type comment
> i string NULL
> j string NULL
> # Detailed Table Information
> Database default
> Table xx
> Owner wzh
> Created Thu Jun 08 18:30:46 PDT 2017
> Last Access Wed Dec 31 16:00:00 PST 1969
> Type MANAGED
> Provider hive
> Properties [serialization.format=1]
> Statistics 4 bytes
> Location file:/Users/wzh/Projects/spark/spark-warehouse/xx
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat org.apache.hadoop.mapred.TextInputFormat
> OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider Catalog
> Time taken: 0.089 seconds, Fetched 19 row(s)
> spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
> Time taken: 0.187 seconds
> spark-sql> insert into table xx select 'c', 'd';
> Time taken: 0.583 seconds
> spark-sql> desc formatted xx;
> # col_name data_type comment
> i string NULL
> j string NULL
> # Detailed Table Information
> Database default
> Table xx
> Owner wzh
> Created Thu Jun 08 18:30:46 PDT 2017
> Last Access Wed Dec 31 16:00:00 PST 1969
> Type MANAGED
> Provider hive
> Properties [serialization.format=1]
> Statistics 4 bytes (-- This should be 8 bytes)
> Location file:/Users/wzh/Projects/spark/spark-warehouse/xx
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat org.apache.hadoop.mapred.TextInputFormat
> OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider Catalog
> Time taken: 0.077 seconds, Fetched 19 row(s)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]