Re: Which time based partition column to choose for streaming

Ryan Blue Fri, 08 Sep 2023 08:58:43 -0700

What we normally do is consume the data incrementally, using the snapshot
ID as your place in time. If you are reprocessing after snapshots have
expired, then you can usually do a good job by using a window of event time
with an ingestion time filter. I'd definitely store just one copy if
possible, and most people prefer to use event time so that users (who are
less sophisticated) don't have trouble querying the table.


On Fri, Sep 8, 2023 at 8:30 AM Nirav Patel <nira...@gmail.com> wrote:

> I am using spark-streaming to ingest live event streams every 5 minutes
> and append into iceberg  table. This table can be ingested and processed by
> a downstream data pipeline or it can be directly used by the end consumer
> to do data analysis.
>
> Every event based datapoint has publish_time, event_time and at some
> point it also has ingestions_time and store_time.
>
> I can see that event_time will be most useful for querying as there can
> be lot of late arrival data and event_time will give updated value
> everytime its queried. I can partition table on event_time for this use
> case. However, another use case is for data reprocessing in event of any
> failure, data inconsistency. For reprocessing, it makes more sense to
> reprocess on store_time or ingestion_time. but if data in only
> partitioned by event_time than querying based on ingestion_time will scan
> the entire table.
>
> solutions I thought of are:
>
> 1) I can create two sets of table for each schema. one partitioned by
> event_time and another by ingestion_time. however, its harder to maintain.
>
> 1.1) Instead of entire two copies of data, just have reverse indexing. ie
> partitioned by instestion_time and store only event_time as a value.
>
>  2) I can partitioned by ingestion_time and during querying I can just use
> big enough time range to accommodate for late data. however, it's not
> consumer friendly query.
>
>
> Is there any better suggestion ?
>
>
> Best,
>
> Nirav
>


-- 
Ryan Blue
Tabular

Re: Which time based partition column to choose for streaming

Reply via email to