[
https://issues.apache.org/jira/browse/HIVE-29116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vikram Ahuja updated HIVE-29116:
--------------------------------
Description:
The Hive property {{hive.exec.default.partition.name}} is currently a
*session-level configuration* that determines the directory name used when
partition column values are {{NULL}} or empty strings. While useful for
controlling default behavior, this setting introduces *serious inconsistencies
and operational challenges* in multi-user or shared environments.
h3. *Problems Caused by Session-Scoped Default Partition Names*
h4. 1. *Inconsistent Partition Layouts Across Sessions*
Different users or jobs may configure different values (e.g.,
{{{}__HIVE_DEFAULT_PARTITION__{}}}, {{{}NA{}}}, {{{}UNKNOWN{}}}) resulting in
*multiple folders for NULL partitions* under the same table. This leads to:
* Fragmentation of data
* Unreliable query results
* Duplicate rows or missed data
h4. 2. *Interoperability Failures with External Engines*
Engines like {*}Apache Spark{*}, {*}Trino{*}, and *Presto* are unaware of the
session-scoped Hive config, which results in:
* Missing or partially loaded data when querying Hive tables
* Incorrect partition pruning or data skipping
* Silent logical errors
h4. 3. *Partition Management & Repair Failures*
Commands like {{{}MSCK REPAIR TABLE{}}}, or tools that list partitions,
automatic partition management may treat differently named default partitions
as distinct — making repair, cleanup, and compaction logic brittle.
h4. 4. *Difficulties During Migration to Iceberg*
Modern table formats like *Iceberg* assume consistent and valid partition
paths. When migrating, multiple default partition folders complicate the
process and increase the risk of data loss or inconsistency.
h4. 5. *Storage Bloat & Retention Policy Issues*
Data with NULL partitions can accumulate across multiple folders and may be
missed by retention or cleanup tools. This causes:
* Inefficient storage
* Missed deletes
* Garbage accumulation
h4. 6. *Risk of Human Error and Debugging Overhead*
Since this is a session-level config, developers and analysts may forget to set
it consistently — especially during ad-hoc queries or notebook exploration.
This leads to:
* Hard-to-reproduce bugs
* Test environment differences
* Broken CI/CD data tests
h3. *Proposed Improvements*
We propose the following:
# Create a DDL for setting hive default partition name at the table level.
# Make hive.exec.deault.partition.name immutable at runtime thus only allowing
it at a cluster level.
The above 2 changes ensures a single consistent default partition folder per
table, since this value will be at the table level other engines can also
utilize this.
was:
# Create a DDL for setting hive default partition name at the table level.
# Make hive.exec.deault.partition.name immutable at runtime
> Create a DDL for setting hive default partition name at the table level
> -----------------------------------------------------------------------
>
> Key: HIVE-29116
> URL: https://issues.apache.org/jira/browse/HIVE-29116
> Project: Hive
> Issue Type: Bug
> Reporter: Vikram Ahuja
> Assignee: Vikram Ahuja
> Priority: Major
> Labels: pull-request-available
>
> The Hive property {{hive.exec.default.partition.name}} is currently a
> *session-level configuration* that determines the directory name used when
> partition column values are {{NULL}} or empty strings. While useful for
> controlling default behavior, this setting introduces *serious
> inconsistencies and operational challenges* in multi-user or shared
> environments.
>
> h3. *Problems Caused by Session-Scoped Default Partition Names*
> h4. 1. *Inconsistent Partition Layouts Across Sessions*
> Different users or jobs may configure different values (e.g.,
> {{{}__HIVE_DEFAULT_PARTITION__{}}}, {{{}NA{}}}, {{{}UNKNOWN{}}}) resulting in
> *multiple folders for NULL partitions* under the same table. This leads to:
> * Fragmentation of data
> * Unreliable query results
> * Duplicate rows or missed data
> h4. 2. *Interoperability Failures with External Engines*
> Engines like {*}Apache Spark{*}, {*}Trino{*}, and *Presto* are unaware of the
> session-scoped Hive config, which results in:
> * Missing or partially loaded data when querying Hive tables
> * Incorrect partition pruning or data skipping
> * Silent logical errors
> h4. 3. *Partition Management & Repair Failures*
> Commands like {{{}MSCK REPAIR TABLE{}}}, or tools that list partitions,
> automatic partition management may treat differently named default partitions
> as distinct — making repair, cleanup, and compaction logic brittle.
> h4. 4. *Difficulties During Migration to Iceberg*
> Modern table formats like *Iceberg* assume consistent and valid partition
> paths. When migrating, multiple default partition folders complicate the
> process and increase the risk of data loss or inconsistency.
> h4. 5. *Storage Bloat & Retention Policy Issues*
> Data with NULL partitions can accumulate across multiple folders and may be
> missed by retention or cleanup tools. This causes:
> * Inefficient storage
> * Missed deletes
> * Garbage accumulation
> h4. 6. *Risk of Human Error and Debugging Overhead*
> Since this is a session-level config, developers and analysts may forget to
> set it consistently — especially during ad-hoc queries or notebook
> exploration. This leads to:
> * Hard-to-reproduce bugs
> * Test environment differences
> * Broken CI/CD data tests
>
>
> h3. *Proposed Improvements*
> We propose the following:
> # Create a DDL for setting hive default partition name at the table level.
> # Make hive.exec.deault.partition.name immutable at runtime thus only
> allowing it at a cluster level.
> The above 2 changes ensures a single consistent default partition folder per
> table, since this value will be at the table level other engines can also
> utilize this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)