[ 
https://issues.apache.org/jira/browse/HIVE-28813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongdan Zhu updated HIVE-28813:
-------------------------------
    Description: 
Add warning when
MSCK/Analyze commands can show a warning in console for table/partition if 
(totalSize / numFiles) is less than avgFileSize.

The code changes can add *small-file warnings* to Hive so that when users run 
*{{MSCK REPAIR}}* or {*}{{ANALYZE}}{*}, Hive can *print a console warning* if 
it detects that the table/partitions contain {*}too many small files{*}, along 
with basic stats (e.g., file count / average file size) to help users catch 
performance risks early.

Sample:

In msck command:

 
{code:java}
hive> MSCK REPAIR TABLE sales;
Partitions not in metastore: sales/dt=2025-01-01, sales/dt=2025-01-02
Repair: Added partition to metastore sales/dt=2025-01-01
Repair: Added partition to metastore sales/dt=2025-01-02
[MSCK] Small files detected.
[MSCK] Average file size is too small, small files exist.
 Partition name: dt=2025-01-01. Small files detected: partition dt=2025-01-01 
(avgBytes=2048, files=5000, totalBytes=10240000)
[MSCK] Average file size is too small, small files exist.
 Partition name: dt=2025-01-02. Small files detected: partition dt=2025-01-02 
(avgBytes=1024, files=8000, totalBytes=8192000)
OK
{code}
In analyze command:
{code:java}
hive> ANALYZE TABLE sales PARTITION(dt='2025-01-01') COMPUTE STATISTICS;
...
[ANALYZE] Small files detected: partition dt=2025-01-01 (avgBytes=2048, 
files=5000, totalBytes=10240000)
OK{code}
 

  was:
Add warning when
MSCK/Analyze commands can show a warning in console for table/partition if 
(totalSize / numFiles) is less than avgFileSize


> MSCK/Analyze commands can show a warning in console for Small files.
> --------------------------------------------------------------------
>
>                 Key: HIVE-28813
>                 URL: https://issues.apache.org/jira/browse/HIVE-28813
>             Project: Hive
>          Issue Type: Task
>            Reporter: Hongdan Zhu
>            Assignee: Hongdan Zhu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.2.0
>
>
> Add warning when
> MSCK/Analyze commands can show a warning in console for table/partition if 
> (totalSize / numFiles) is less than avgFileSize.
> The code changes can add *small-file warnings* to Hive so that when users run 
> *{{MSCK REPAIR}}* or {*}{{ANALYZE}}{*}, Hive can *print a console warning* if 
> it detects that the table/partitions contain {*}too many small files{*}, 
> along with basic stats (e.g., file count / average file size) to help users 
> catch performance risks early.
> Sample:
> In msck command:
>  
> {code:java}
> hive> MSCK REPAIR TABLE sales;
> Partitions not in metastore: sales/dt=2025-01-01, sales/dt=2025-01-02
> Repair: Added partition to metastore sales/dt=2025-01-01
> Repair: Added partition to metastore sales/dt=2025-01-02
> [MSCK] Small files detected.
> [MSCK] Average file size is too small, small files exist.
>  Partition name: dt=2025-01-01. Small files detected: partition dt=2025-01-01 
> (avgBytes=2048, files=5000, totalBytes=10240000)
> [MSCK] Average file size is too small, small files exist.
>  Partition name: dt=2025-01-02. Small files detected: partition dt=2025-01-02 
> (avgBytes=1024, files=8000, totalBytes=8192000)
> OK
> {code}
> In analyze command:
> {code:java}
> hive> ANALYZE TABLE sales PARTITION(dt='2025-01-01') COMPUTE STATISTICS;
> ...
> [ANALYZE] Small files detected: partition dt=2025-01-01 (avgBytes=2048, 
> files=5000, totalBytes=10240000)
> OK{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to