u-ranjith-kumar opened a new issue, #17535:
URL: https://github.com/apache/pinot/issues/17535
We are using **OFFLINE dimension tables** in Apache Pinot and are facing
**duplicate rows with the same primary key** during batch ingestion.
Currently:
* `APPEND` ingestion is not supported for dimension tables
* `REFRESH` ingestion keeps re-reading the same input files
* This results in **duplicate primary keys** in the dimension table
Pinot recently added support to **detect and error on duplicate primary
keys** using:
```json
"dimensionTableConfig": {
"errorOnDuplicatePrimaryKey": true
}
```
(PR: #12290)
While this helps catch the issue, it does not solve the core use case where
we want to overwrite existing rows by primary key(UPSERT semantics) instead of
failing ingestion.
Current Behavior
* OFFLINE dimension tables do not support UPSERT
* Duplicate primary keys are either:
* silently allowed (default), or
* rejected using `errorOnDuplicatePrimaryKey=true`
* There is no way to overwrite an existing row for a primary key during
OFFLINE ingestion
---
Expected Behavior
Support UPSERT semantics for OFFLINE dimension tables, similar to REALTIME
upsert tables:
* If a record with an existing primary key is ingested:
* overwrite the existing row
* do not create duplicate records
* Allow deterministic, idempotent batch ingestion
* Enable safe reprocessing and reruns
---
Why this is needed
Offline dimension tables are commonly used for:
* Slowly changing dimensions (store, area, category, mappings)
* Periodic full refreshes or partial backfills
* Reference data that naturally evolves over time
Without upsert support:
* Pipelines are fragile
* Reruns cause duplicates
* Users are forced to move dimension data to REALTIME ingestion, which is
not always desirable
---
Workarounds today
1. Enable strict validation:
```json
"dimensionTableConfig": {
"errorOnDuplicatePrimaryKey": true
}
```
→ Prevents bad data but breaks ingestion
2. Move dimension data to REALTIME upsert table
→ Works, but adds operational complexity and is not ideal for
batch-managed dimensions
---
Proposal
Add support for OFFLINE UPSERT dimension tables, where:
* Primary key uniqueness is enforced
* Latest record overwrites the previous one
* Behavior is deterministic and rerun-safe
This would align OFFLINE dimension tables with REALTIME upsert capabilities
and significantly simplify batch ingestion workflows.
---
Related Issues / PRs
* Duplicate primary key handling for dimension tables: #12284
* Disallow duplicate primary keys (error-only):
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]