[ 
https://issues.apache.org/jira/browse/HIVE-21037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Bapat reassigned HIVE-21037:
-------------------------------------


> Replicate column statistics for Hive tables
> -------------------------------------------
>
>                 Key: HIVE-21037
>                 URL: https://issues.apache.org/jira/browse/HIVE-21037
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>            Reporter: Ashutosh Bapat
>            Assignee: Ashutosh Bapat
>            Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Statistics is important for query optimizations and thus keeping those 
> up-to-date on replica is important from query performance perspective. The 
> statistics are collected by scanning a table entirely. Thus when the data is 
> replicated a. we could update the statistics by scanning it on replica or b. 
> we could just replicate the statistics also. For following reasons we desire 
> to go by the second approach instead of the first.
>  # Scanning the data on replica isn’t a good option since it wastes CPU 
> cycles and puts load during replication, which can be significant.
>  # Storages like S3 may not have compute capabilities and thus when we are 
> replicating from on-prem to cloud, we can not rely on the target to gather 
> statistics.
>  # For ACID tables, the statistics should be associated with the snapshot. 
> This means the statistics collection on target should sync with the write-id 
> on the source since target doesn't generate target ids of its own.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to