I am using spark-streaming to ingest live event streams every 5 minutes and
append into iceberg  table. This table can be ingested and processed by a
downstream data pipeline or it can be directly used by the end consumer to
do data analysis.

Every event based datapoint has publish_time, event_time and at some point
it also has ingestions_time and store_time.

I can see that event_time will be most useful for querying as there can be
lot of late arrival data and event_time will give updated value everytime
its queried. I can partition table on event_time for this use case.
However, another use case is for data reprocessing in event of any failure,
data inconsistency. For reprocessing, it makes more sense to reprocess on
store_time or ingestion_time. but if data in only partitioned by
event_time than
querying based on ingestion_time will scan the entire table.

solutions I thought of are:

1) I can create two sets of table for each schema. one partitioned by
event_time and another by ingestion_time. however, its harder to maintain.

1.1) Instead of entire two copies of data, just have reverse indexing. ie
partitioned by instestion_time and store only event_time as a value.

 2) I can partitioned by ingestion_time and during querying I can just use
big enough time range to accommodate for late data. however, it's not
consumer friendly query.


Is there any better suggestion ?


Best,

Nirav

Reply via email to