[
https://issues.apache.org/jira/browse/HIVE-21037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ashutosh Bapat reassigned HIVE-21037:
-------------------------------------
> Replicate column statistics for Hive tables
> -------------------------------------------
>
> Key: HIVE-21037
> URL: https://issues.apache.org/jira/browse/HIVE-21037
> Project: Hive
> Issue Type: Improvement
> Components: HiveServer2
> Reporter: Ashutosh Bapat
> Assignee: Ashutosh Bapat
> Priority: Major
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> Statistics is important for query optimizations and thus keeping those
> up-to-date on replica is important from query performance perspective. The
> statistics are collected by scanning a table entirely. Thus when the data is
> replicated a. we could update the statistics by scanning it on replica or b.
> we could just replicate the statistics also. For following reasons we desire
> to go by the second approach instead of the first.
> # Scanning the data on replica isn’t a good option since it wastes CPU
> cycles and puts load during replication, which can be significant.
> # Storages like S3 may not have compute capabilities and thus when we are
> replicating from on-prem to cloud, we can not rely on the target to gather
> statistics.
> # For ACID tables, the statistics should be associated with the snapshot.
> This means the statistics collection on target should sync with the write-id
> on the source since target doesn't generate target ids of its own.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)