[ 
https://issues.apache.org/jira/browse/HIVE-29116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikram Ahuja updated HIVE-29116:
--------------------------------
    Description: 
The Hive property {{hive.exec.default.partition.name}} is currently a 
*session-level configuration* that determines the directory name used when 
partition column values are {{NULL}} or empty strings. While useful for 
controlling default behavior, this setting introduces *serious inconsistencies 
and operational challenges* in multi-user or shared environments.

 
h3. *Problems Caused by Session-Scoped Default Partition Names*
h4. 1. *Inconsistent Partition Layouts Across Sessions*

Different users or jobs may configure different values (e.g., 
{{{}__HIVE_DEFAULT_PARTITION__{}}}, {{{}NA{}}}, {{{}UNKNOWN{}}}) resulting in 
*multiple folders for NULL partitions* under the same table. This leads to:
 * Fragmentation of data

 * Unreliable query results

 * Duplicate rows or missed data

h4. 2. *Interoperability Failures with External Engines*

Engines like {*}Apache Spark{*}, {*}Trino{*}, and *Presto* are unaware of the 
session-scoped Hive config, which results in:
 * Missing or partially loaded data when querying Hive tables

 * Incorrect partition pruning or data skipping

 * Silent logical errors

h4. 3. *Partition Management & Repair Failures*

Commands like {{{}MSCK REPAIR TABLE{}}}, or tools that list partitions, 
automatic partition management may treat differently named default partitions 
as distinct — making repair, cleanup, and compaction logic brittle.
h4. 4. *Difficulties During Migration to Iceberg*

Modern table formats like *Iceberg* assume consistent and valid partition 
paths. When migrating, multiple default partition folders complicate the 
process and increase the risk of data loss or inconsistency.
h4. 5. *Storage Bloat & Retention Policy Issues*

Data with NULL partitions can accumulate across multiple folders and may be 
missed by retention or cleanup tools. This causes:
 * Inefficient storage

 * Missed deletes

 * Garbage accumulation

h4. 6. *Risk of Human Error and Debugging Overhead*

Since this is a session-level config, developers and analysts may forget to set 
it consistently — especially during ad-hoc queries or notebook exploration. 
This leads to:
 * Hard-to-reproduce bugs

 * Test environment differences

 * Broken CI/CD data tests

 

 
h3. *Proposed Improvements*

We propose the following:
 # Create a DDL for setting hive default partition name at the table level.
 # Make hive.exec.deault.partition.name immutable at runtime thus only allowing 
it at a cluster level.

The above 2 changes  ensures a single consistent default partition folder per 
table, since this value will be at the table level other engines can also 
utilize this.

  was:
# Create a DDL for setting hive default partition name at the table level.
 # Make hive.exec.deault.partition.name immutable at runtime


> Create a DDL for setting hive default partition name at the table level
> -----------------------------------------------------------------------
>
>                 Key: HIVE-29116
>                 URL: https://issues.apache.org/jira/browse/HIVE-29116
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Vikram Ahuja
>            Assignee: Vikram Ahuja
>            Priority: Major
>              Labels: pull-request-available
>
> The Hive property {{hive.exec.default.partition.name}} is currently a 
> *session-level configuration* that determines the directory name used when 
> partition column values are {{NULL}} or empty strings. While useful for 
> controlling default behavior, this setting introduces *serious 
> inconsistencies and operational challenges* in multi-user or shared 
> environments.
>  
> h3. *Problems Caused by Session-Scoped Default Partition Names*
> h4. 1. *Inconsistent Partition Layouts Across Sessions*
> Different users or jobs may configure different values (e.g., 
> {{{}__HIVE_DEFAULT_PARTITION__{}}}, {{{}NA{}}}, {{{}UNKNOWN{}}}) resulting in 
> *multiple folders for NULL partitions* under the same table. This leads to:
>  * Fragmentation of data
>  * Unreliable query results
>  * Duplicate rows or missed data
> h4. 2. *Interoperability Failures with External Engines*
> Engines like {*}Apache Spark{*}, {*}Trino{*}, and *Presto* are unaware of the 
> session-scoped Hive config, which results in:
>  * Missing or partially loaded data when querying Hive tables
>  * Incorrect partition pruning or data skipping
>  * Silent logical errors
> h4. 3. *Partition Management & Repair Failures*
> Commands like {{{}MSCK REPAIR TABLE{}}}, or tools that list partitions, 
> automatic partition management may treat differently named default partitions 
> as distinct — making repair, cleanup, and compaction logic brittle.
> h4. 4. *Difficulties During Migration to Iceberg*
> Modern table formats like *Iceberg* assume consistent and valid partition 
> paths. When migrating, multiple default partition folders complicate the 
> process and increase the risk of data loss or inconsistency.
> h4. 5. *Storage Bloat & Retention Policy Issues*
> Data with NULL partitions can accumulate across multiple folders and may be 
> missed by retention or cleanup tools. This causes:
>  * Inefficient storage
>  * Missed deletes
>  * Garbage accumulation
> h4. 6. *Risk of Human Error and Debugging Overhead*
> Since this is a session-level config, developers and analysts may forget to 
> set it consistently — especially during ad-hoc queries or notebook 
> exploration. This leads to:
>  * Hard-to-reproduce bugs
>  * Test environment differences
>  * Broken CI/CD data tests
>  
>  
> h3. *Proposed Improvements*
> We propose the following:
>  # Create a DDL for setting hive default partition name at the table level.
>  # Make hive.exec.deault.partition.name immutable at runtime thus only 
> allowing it at a cluster level.
> The above 2 changes  ensures a single consistent default partition folder per 
> table, since this value will be at the table level other engines can also 
> utilize this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to