DanielZhu58 opened a new pull request, #6264:
URL: https://github.com/apache/hive/pull/6264
What changes were proposed in this pull request?
To add a new StatisticsManagementTask.java to automatically delete the old
stats.
Why are the changes needed?
To help reduce the old or stale stats.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manual tests and unit tests.
For reviewers: What this PR does
This PR introduces a new “Statistics Management Task” in the Hive metastore
which periodically auto-deletes stale column statistics, plus configuration
knobs.
In MetastoreConf.java
Three new configuration variables are added:
STATISTICS_MANAGEMENT_TASK_FREQUENCY
Meaning: Controls how often the StatisticsManagementTask runs, for tables
that have statistics.auto.deletion=true in their table properties.
STATISTICS_RETENTION_PERIOD
Meaning: The retention period for stats. If a table/partition’s stats are
older than this, they become candidates for auto deletion.
STATISTICS_AUTO_DELETION
In StatisticsManagementTask.java
Defines a new StatisticsManagementTask implementing MetastoreTaskThread. Its
purpose is to:
Fetch STATISTICS_RETENTION_PERIOD and STATISTICS_AUTO_DELETION from conf. If
retention <= 0 or auto deletion is disabled, log and return.
Compute lastAnalyzedThreshold = (now - retentionMillis) / 1000 (in seconds).
Use HMSHandler.getMSForConf(conf) to get RawStore and a PersistenceManager,
then query MTableColumnStatistics rows where lastAnalyzed < threshold.
In short, this class implements a background cleanup task that scans
MTableColumnStatistics for stale entries and deletes them via the metastore
client.
In BenchmarkTool.java
BenchmarkTool can now benchmark the new statistics management task for
different numbers of tables.
In HMSBenchmarks.java Test
Constructs a dedicated database name and table prefix based on tableCount
and BenchData.
Gets an HMSClient and instantiates a StatisticsManagementTask.
Configures the client Hadoop conf:
hive.metastore.uris = metastore URI
metastore.statistics.management.database.pattern = dbName (so the task
focuses on this DB)
Sets the task’s conf and creates the database and tableCount tables:
Simulates old stats:
For each partition, sets lastAnalyzed to now - 400 days in the partition
parameters and alters the partition.
Post-run assertion:
Re-scans all partitions; if any partition parameters still contain
lastAnalyzed, it throws an AssertionError("Partition stats not deleted for
table: " + tableName).
In other words, this is an end-to-end microbenchmark for the new
StatisticsManagementTask that both measures performance and verifies that “old”
partition stats are actually cleaned up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]