[
https://issues.apache.org/jira/browse/HIVE-28813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hongdan Zhu updated HIVE-28813:
-------------------------------
Description:
Add warning when
MSCK/Analyze commands can show a warning in console for table/partition if
(totalSize / numFiles) is less than avgFileSize.
The code changes can add *small-file warnings* to Hive so that when users run
*{{MSCK REPAIR}}* or {*}{{ANALYZE}}{*}, Hive can *print a console warning* if
it detects that the table/partitions contain {*}too many small files{*}, along
with basic stats (e.g., file count / average file size) to help users catch
performance risks early.
Sample:
In msck command:
{code:java}
hive> MSCK REPAIR TABLE sales;
Partitions not in metastore: sales/dt=2025-01-01, sales/dt=2025-01-02
Repair: Added partition to metastore sales/dt=2025-01-01
Repair: Added partition to metastore sales/dt=2025-01-02
[MSCK] Small files detected.
[MSCK] Average file size is too small, small files exist.
Partition name: dt=2025-01-01. Small files detected: partition dt=2025-01-01
(avgBytes=2048, files=5000, totalBytes=10240000)
[MSCK] Average file size is too small, small files exist.
Partition name: dt=2025-01-02. Small files detected: partition dt=2025-01-02
(avgBytes=1024, files=8000, totalBytes=8192000)
OK
{code}
In analyze command:
{code:java}
hive> ANALYZE TABLE sales PARTITION(dt='2025-01-01') COMPUTE STATISTICS;
...
[ANALYZE] Small files detected: partition dt=2025-01-01 (avgBytes=2048,
files=5000, totalBytes=10240000)
OK{code}
was:
Add warning when
MSCK/Analyze commands can show a warning in console for table/partition if
(totalSize / numFiles) is less than avgFileSize
> MSCK/Analyze commands can show a warning in console for Small files.
> --------------------------------------------------------------------
>
> Key: HIVE-28813
> URL: https://issues.apache.org/jira/browse/HIVE-28813
> Project: Hive
> Issue Type: Task
> Reporter: Hongdan Zhu
> Assignee: Hongdan Zhu
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.2.0
>
>
> Add warning when
> MSCK/Analyze commands can show a warning in console for table/partition if
> (totalSize / numFiles) is less than avgFileSize.
> The code changes can add *small-file warnings* to Hive so that when users run
> *{{MSCK REPAIR}}* or {*}{{ANALYZE}}{*}, Hive can *print a console warning* if
> it detects that the table/partitions contain {*}too many small files{*},
> along with basic stats (e.g., file count / average file size) to help users
> catch performance risks early.
> Sample:
> In msck command:
>
> {code:java}
> hive> MSCK REPAIR TABLE sales;
> Partitions not in metastore: sales/dt=2025-01-01, sales/dt=2025-01-02
> Repair: Added partition to metastore sales/dt=2025-01-01
> Repair: Added partition to metastore sales/dt=2025-01-02
> [MSCK] Small files detected.
> [MSCK] Average file size is too small, small files exist.
> Partition name: dt=2025-01-01. Small files detected: partition dt=2025-01-01
> (avgBytes=2048, files=5000, totalBytes=10240000)
> [MSCK] Average file size is too small, small files exist.
> Partition name: dt=2025-01-02. Small files detected: partition dt=2025-01-02
> (avgBytes=1024, files=8000, totalBytes=8192000)
> OK
> {code}
> In analyze command:
> {code:java}
> hive> ANALYZE TABLE sales PARTITION(dt='2025-01-01') COMPUTE STATISTICS;
> ...
> [ANALYZE] Small files detected: partition dt=2025-01-01 (avgBytes=2048,
> files=5000, totalBytes=10240000)
> OK{code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)