[ 
https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-21031:
---------------------------------
    Summary: Add `alterTableStats` to store spark's stats and let `alterTable` 
keep existing stats  (was: Clearly separate hive stats and spark stats in 
catalog)

> Add `alterTableStats` to store spark's stats and let `alterTable` keep 
> existing stats
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-21031
>                 URL: https://issues.apache.org/jira/browse/SPARK-21031
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Zhenhua Wang
>
> Currently, hive's stats are read into `CatalogStatistics`, while spark's 
> stats are also persisted through `CatalogStatistics`. As a result, hive's 
> stats can be unexpectedly propagated into spark' stats.
> For example, for a catalog table, we read stats from hive, e.g. "totalSize" 
> and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we 
> will store the stats in `CatalogStatistics` into metastore as spark's stats 
> (because we don't know whether it's from spark or not). But spark's stats 
> should be only generated by "ANALYZE" command. This is unexpected from this 
> command.
> Secondly, now that we have spark's stats in metastore, after inserting new 
> data, although hive updated "totalSize" in metastore, we still cannot get the 
> right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats 
> (should not exist) over hive's stats.
> {code}
> spark-sql> create table xx(i string, j string);
> spark-sql> insert into table xx select 'a', 'b';
> spark-sql> desc formatted xx;
> # col_name    data_type       comment
> i     string  NULL
> j     string  NULL
> # Detailed Table Information          
> Database      default 
> Table xx      
> Owner wzh     
> Created       Thu Jun 08 18:30:46 PDT 2017    
> Last Access   Wed Dec 31 16:00:00 PST 1969    
> Type  MANAGED 
> Provider      hive    
> Properties    [serialization.format=1]        
> Statistics    4 bytes 
> Location      file:/Users/wzh/Projects/spark/spark-warehouse/xx       
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
> InputFormat   org.apache.hadoop.mapred.TextInputFormat        
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      
> Partition Provider    Catalog 
> Time taken: 0.089 seconds, Fetched 19 row(s)
> spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
> Time taken: 0.187 seconds
> spark-sql> insert into table xx select 'c', 'd';
> Time taken: 0.583 seconds
> spark-sql> desc formatted xx;
> # col_name    data_type       comment
> i     string  NULL
> j     string  NULL
> # Detailed Table Information          
> Database      default 
> Table xx      
> Owner wzh     
> Created       Thu Jun 08 18:30:46 PDT 2017    
> Last Access   Wed Dec 31 16:00:00 PST 1969    
> Type  MANAGED 
> Provider      hive    
> Properties    [serialization.format=1]        
> Statistics    4 bytes (-- This should be 8 bytes)
> Location      file:/Users/wzh/Projects/spark/spark-warehouse/xx       
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
> InputFormat   org.apache.hadoop.mapred.TextInputFormat        
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      
> Partition Provider    Catalog 
> Time taken: 0.077 seconds, Fetched 19 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to