How do I use snapshot here to reprocess? Let me give more context/example on reprocessing needs.
Pipeline A writes processed events to Table A Pipeline B, C reads from TableA and writes to Table B and Table C respectively. We want TableA, TableB and TableC to be efficiently queriably by event_time. In that case, partitioning them by event_time make sense. Issus is, Many events are coming very late, upto 6 months old. we still process them and write to correct event_time based partitions in all tables. Caveat is with needs of reprocessing any arbitrary partition data upto this last 6 months. e.g. we need to fix data of Table B and Table C between 08-30-2023 and 09-07-2023 as we know some bug happened after 08-30. but as we allow late data processing during that time frame we could have processed 03-2023 data as well. so we need to reprocess that as well. so we need to collect all the event_time partitions and re-ingest them from TableA for reprocessing. Now, using a "snapshot id as my place int time" you mean I can derive event_time partitions that got affected (added, updated) based on wall clock? how exactly do I do that? I know there are history, manifest, snapshot table but haven't dig into them to see if it's possbile to derive what partitions got added or updated based on wall clock time range. On Fri, Sep 8, 2023 at 8:58 AM Ryan Blue <b...@tabular.io> wrote: > What we normally do is consume the data incrementally, using the snapshot > ID as your place in time. If you are reprocessing after snapshots have > expired, then you can usually do a good job by using a window of event time > with an ingestion time filter. I'd definitely store just one copy if > possible, and most people prefer to use event time so that users (who are > less sophisticated) don't have trouble querying the table. > > On Fri, Sep 8, 2023 at 8:30 AM Nirav Patel <nira...@gmail.com> wrote: > >> I am using spark-streaming to ingest live event streams every 5 minutes >> and append into iceberg table. This table can be ingested and processed by >> a downstream data pipeline or it can be directly used by the end >> consumer to do data analysis. >> >> Every event based datapoint has publish_time, event_time and at some >> point it also has ingestions_time and store_time. >> >> I can see that event_time will be most useful for querying as there can >> be lot of late arrival data and event_time will give updated value >> everytime its queried. I can partition table on event_time for this use >> case. However, another use case is for data reprocessing in event of any >> failure, data inconsistency. For reprocessing, it makes more sense to >> reprocess on store_time or ingestion_time. but if data in only >> partitioned by event_time than querying based on ingestion_time will >> scan the entire table. >> >> solutions I thought of are: >> >> 1) I can create two sets of table for each schema. one partitioned by >> event_time and another by ingestion_time. however, its harder to maintain. >> >> 1.1) Instead of entire two copies of data, just have reverse indexing. ie >> partitioned by instestion_time and store only event_time as a value. >> >> 2) I can partitioned by ingestion_time and during querying I can just >> use big enough time range to accommodate for late data. however, it's not >> consumer friendly query. >> >> >> Is there any better suggestion ? >> >> >> Best, >> >> Nirav >> > > > -- > Ryan Blue > Tabular >