What we normally do is consume the data incrementally, using the snapshot ID as your place in time. If you are reprocessing after snapshots have expired, then you can usually do a good job by using a window of event time with an ingestion time filter. I'd definitely store just one copy if possible, and most people prefer to use event time so that users (who are less sophisticated) don't have trouble querying the table.
On Fri, Sep 8, 2023 at 8:30 AM Nirav Patel <nira...@gmail.com> wrote: > I am using spark-streaming to ingest live event streams every 5 minutes > and append into iceberg table. This table can be ingested and processed by > a downstream data pipeline or it can be directly used by the end consumer > to do data analysis. > > Every event based datapoint has publish_time, event_time and at some > point it also has ingestions_time and store_time. > > I can see that event_time will be most useful for querying as there can > be lot of late arrival data and event_time will give updated value > everytime its queried. I can partition table on event_time for this use > case. However, another use case is for data reprocessing in event of any > failure, data inconsistency. For reprocessing, it makes more sense to > reprocess on store_time or ingestion_time. but if data in only > partitioned by event_time than querying based on ingestion_time will scan > the entire table. > > solutions I thought of are: > > 1) I can create two sets of table for each schema. one partitioned by > event_time and another by ingestion_time. however, its harder to maintain. > > 1.1) Instead of entire two copies of data, just have reverse indexing. ie > partitioned by instestion_time and store only event_time as a value. > > 2) I can partitioned by ingestion_time and during querying I can just use > big enough time range to accommodate for late data. however, it's not > consumer friendly query. > > > Is there any better suggestion ? > > > Best, > > Nirav > -- Ryan Blue Tabular