avshenuk opened a new issue, #16359:
URL: https://github.com/apache/pinot/issues/16359
### What needs to be done:
Add upsert support to RealtimeToOfflineSegmentsTask to enable Pinot-managed
offline flows for upsert-enabled realtime tables.
### Why the feature is needed:
Currently, RealtimeToOfflineSegmentsTask doesn't support upsert tables,
creating a gap in Pinot's managed offline flows for hybrid upsert
architectures.
This prevents users from leveraging the efficient realtime-to-offline
conversion pipeline when using upsert tables.
This feature is particularly valuable as it enables a clean separation
between realtime tables with upsert support (handling mutable, evolving data)
and offline tables containing the "final" immutable state of segments.
This architecture allows users to benefit from upsert semantics during
real-time ingestion while maintaining optimized, deduplicated offline segments
for historical queries and analytics workloads.
### Proposed Implementation:
1. **Upsert Segment Validation**: Validate segments using validDocIds
metadata from servers with CRC checks and server state validation. Fail task
generation when metadata is missing since time windows cannot be reprocessed,
preventing silent data loss.
2. **Invalid Record Filtering**: Integrate CompactedPinotSegmentRecordReader
with the existing SegmentProcessorFramework to filter out invalid records
during segment processing, naturally combining with time windowing,
partitioning, merging, and sorting operations.
### Benefits:
- Fills the gap for missing upsert support in Pinot's managed offline flow
- Natural integration with existing segment processing pipeline (time
filtering, sorting, etc.)
- Maintains data consistency through proper validation and prevents silent
data loss
- Optimal timing for upsert compaction:
Performs filtering during the realtime-to-offline conversion when data is
already being processed, avoiding the need for separate compaction passes and
maximizing efficiency
- Operates independently of other upsert maintenance tasks with clear
separation of concerns:
- other upsert tasks (UpsertCompactionTask, UpsertCompactMergeTask)
optimize realtime tables, while this task creates optimized segments in offline
tables. This allows maintaining efficiency within realtime tables (if needed)
while continuously updating offline tables with deduplicated historical data
### Backward Compatibility:
This implementation maintains full backward compatibility for existing
non-upsert tables.
The upsert-specific logic (validDocIds validation and
CompactedPinotSegmentRecordReader filtering) is only applied when the table has
upsert configuration enabled.
For non-upsert tables, the RealtimeToOfflineSegmentsTask continues to
operate with the exact same behavior as before, ensuring no breaking changes to
existing workflows.
### Related Issues:
- Related to #12261 discussion on hybrid table upsert configurations
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]