I am using spark-streaming to ingest live event streams every 5 minutes and append into iceberg table. This table can be ingested and processed by a downstream data pipeline or it can be directly used by the end consumer to do data analysis.
Every event based datapoint has publish_time, event_time and at some point it also has ingestions_time and store_time. I can see that event_time will be most useful for querying as there can be lot of late arrival data and event_time will give updated value everytime its queried. I can partition table on event_time for this use case. However, another use case is for data reprocessing in event of any failure, data inconsistency. For reprocessing, it makes more sense to reprocess on store_time or ingestion_time. but if data in only partitioned by event_time than querying based on ingestion_time will scan the entire table. solutions I thought of are: 1) I can create two sets of table for each schema. one partitioned by event_time and another by ingestion_time. however, its harder to maintain. 1.1) Instead of entire two copies of data, just have reverse indexing. ie partitioned by instestion_time and store only event_time as a value. 2) I can partitioned by ingestion_time and during querying I can just use big enough time range to accommodate for late data. however, it's not consumer friendly query. Is there any better suggestion ? Best, Nirav