DanielZhu58 opened a new pull request, #6264:
URL: https://github.com/apache/hive/pull/6264

   What changes were proposed in this pull request?
   To add a new StatisticsManagementTask.java to automatically delete the old 
stats.
   
   Why are the changes needed?
   To help reduce the old or stale stats.
   
   Does this PR introduce any user-facing change?
   No.
   
   How was this patch tested?
   Manual tests and unit tests.
   
   For reviewers: What this PR does
   
   This PR introduces a new “Statistics Management Task” in the Hive metastore 
which periodically auto-deletes stale column statistics, plus configuration 
knobs.
   
   In MetastoreConf.java
   Three new configuration variables are added:
   STATISTICS_MANAGEMENT_TASK_FREQUENCY
   Meaning: Controls how often the StatisticsManagementTask runs, for tables 
that have statistics.auto.deletion=true in their table properties.
   STATISTICS_RETENTION_PERIOD
   Meaning: The retention period for stats. If a table/partition’s stats are 
older than this, they become candidates for auto deletion.
   STATISTICS_AUTO_DELETION
   
   In StatisticsManagementTask.java
   Defines a new StatisticsManagementTask implementing MetastoreTaskThread. Its 
purpose is to:
   Fetch STATISTICS_RETENTION_PERIOD and STATISTICS_AUTO_DELETION from conf. If 
retention <= 0 or auto deletion is disabled, log and return.
   Compute lastAnalyzedThreshold = (now - retentionMillis) / 1000 (in seconds).
   Use HMSHandler.getMSForConf(conf) to get RawStore and a PersistenceManager, 
then query MTableColumnStatistics rows where lastAnalyzed < threshold.
   In short, this class implements a background cleanup task that scans 
MTableColumnStatistics for stale entries and deletes them via the metastore 
client.
   
   In BenchmarkTool.java
   BenchmarkTool can now benchmark the new statistics management task for 
different numbers of tables.
   
   In HMSBenchmarks.java Test
   Constructs a dedicated database name and table prefix based on tableCount 
and BenchData.
   Gets an HMSClient and instantiates a StatisticsManagementTask.
   Configures the client Hadoop conf:
   hive.metastore.uris = metastore URI
   metastore.statistics.management.database.pattern = dbName (so the task 
focuses on this DB)
   Sets the task’s conf and creates the database and tableCount tables:
   Simulates old stats:
   For each partition, sets lastAnalyzed to now - 400 days in the partition 
parameters and alters the partition.
   Post-run assertion:
   Re-scans all partitions; if any partition parameters still contain 
lastAnalyzed, it throws an AssertionError("Partition stats not deleted for 
table: " + tableName).
   In other words, this is an end-to-end microbenchmark for the new 
StatisticsManagementTask that both measures performance and verifies that “old” 
partition stats are actually cleaned up.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to