ShivamS136 opened a new issue, #15163:
URL: https://github.com/apache/pinot/issues/15163
## Issue Description
There appears to be a significant difference in deduplication behavior
between Pinot v1.2.0 and v1.3.0. The behavior change affects how records are
deduplicated based on the `dedupTimeColumn` and `metadataTTL` settings.
## Environment
- **Affected Pinot Versions**:
- v1.3.0 (new behavior)
- v1.2.0 (previous behavior)
## Deduplication Behavior Differences
### In v1.3.0:
- Records only get deduped if at least one insertion record's
`dedupTimeColumn` value is at most `metadataTTL` older than current time
- If a record within TTL is inserted, then deduping works
- Records outside TTL are successfully inserted even if the data is the same
(potential duplicates)
- If one record is encountered within TTL value, then the primary key is
created and all future records with the same primary key value get deduped
### In v1.2.0:
- The `dedupTimeColumn` doesn't seem to affect deduplication
- Any record inserted into Pinot gets the primary key generated irrespective
of time column value
- Future records with the same primary key value get deduped consistently
## Expected Behavior
Deduplication should work consistently across versions and should properly
deduplicate records based on the primary key, regardless of the time column
values.
## Table Configuration
<details>
<summary>Table Schema</summary>
```json
{
"schemaName": "leaderboard_entries",
"dimensionFieldSpecs": [
{
"name": "leaderboard_id",
"dataType": "LONG"
},
{
"name": "participant_id",
"dataType": "STRING"
},
{
"name": "attempt_number",
"dataType": "INT",
"defaultNullValue": 1
},
{
"name": "entry_meta",
"dataType": "JSON",
"defaultNullValue": "{}"
}
],
"metricFieldSpecs": [
{
"name": "score",
"dataType": "INT",
"defaultNullValue": 0
}
],
"dateTimeFieldSpecs": [
{
"name": "insertion_time",
"dataType": "LONG",
"format": "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS"
},
{
"name": "attempt_time",
"dataType": "LONG",
"format": "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS"
}
],
"primaryKeyColumns": ["leaderboard_id", "participant_id",
"attempt_number"]
}
```
</details>
<details>
<summary>Table Config</summary>
```json
{
"tableName": "leaderboard_entries",
"tableType": "REALTIME",
"segmentsConfig": {
"timeColumnName": "insertion_time",
"replication": "2",
"retentionTimeUnit": "DAYS",
"retentionTimeValue": "90",
"timeType": "MILLISECONDS"
},
"query": {
"timeoutMs": "5000"
},
"tenants": {},
"tableIndexConfig": {
"sortedColumn": ["score"]
},
"fieldConfigList": [
{
"name": "leaderboard_id",
"indexes": {
"inverted": {}
}
},
{
"name": "participant_id",
"indexes": {
"bloom": {}
}
}
],
"ingestionConfig": {
"streamIngestionConfig": {
"streamConfigMaps": [
{
"streamType": "kafka",
"stream.kafka.consumer.type":
"lowlevel",
"stream.kafka.topic.name":
"leaderboard-entry",
"stream.kafka.broker.list":
"kafka:9092",
"stream.kafka.decoder.class.name":
"org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
"stream.kafka.consumer.factory.class.name":
"org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
"stream.kafka.consumer.prop.auto.offset.reset": "smallest",
"stream.kafka.consumer.prop.format":
"JSON",
"realtime.segment.flush.threshold.time": "4h",
"realtime.segment.flush.threshold.rows": "0",
"realtime.segment.flush.threshold.segment.rows": "0",
"realtime.segment.flush.threshold.segment.size": "20M"
}
]
}
},
"metadata": {
"customConfigs": {}
},
"routing": {
"instanceSelectorType": "strictReplicaGroup"
},
"dedupConfig": {
"dedupEnabled": true,
"hashFunction": "NONE",
"dedupTimeColumn": "insertion_time",
"metadataTTL": 600000,
"enablePreload": true
}
}
```
</details>
## Observations
When using v1.2.0, the following warning appears during table addition,
suggesting that the `dedupTimeColumn` and `metadataTTL` properties might not be
recognized or used in this version:
```json
{
"unrecognizedProperties": {
"/dedupConfig/dedupTimeColumn": "insertion_time",
"/dedupConfig/metadataTTL": 600000
},
"status": "Table leaderboard_entries_REALTIME successfully added"
}
```
## Impact
This behavior change can lead to:
1. Unexpected duplicates in v1.3.0 when records are outside the TTL window
2. Inconsistent deduplication behavior when migrating from v1.2.0 to v1.3.0
3. Potential data integrity issues if applications rely on the previous
deduplication behavior
## Proposed Solution
Either:
1. Restore the v1.2.0 behavior where deduplication works consistently
regardless of time column values, or
2. Clearly document this behavior change and provide configuration options
to maintain backward compatibility
## Additional Information
Related Slack thread with more info:
https://apache-pinot.slack.com/archives/C011C9JHN7R/p1740757158048619
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]