Re: Low latency writes into Iceberg table

Yufei Gu Tue, 22 Jun 2021 17:20:27 -0700

For 3, it may not be worth adding extra complexity by introducing a "change
set", unless we get solid data that shows writing a "change set" is faster
than a complete rewrite.


Best,

Yufei

`This is not a contribution`


On Tue, Jun 22, 2021 at 12:43 PM Sreeram Garlapati <[email protected]>
wrote:

> Hello Iceberg devs!
>
> Did any of you solve "low latency writes to Iceberg"? Overall, it boiled
> down to 2 questions:
> 1. Is there a way to add indexes to Iceberg table - to support equality
> based filters (pl. see the point #2 below for more explanation)
> 2. Is there a workstream to support writing delta's of metadata changes
> (pl. see point #3 below)?
>
> Best,
> Sreeram
>
> https://github.com/apache/iceberg/issues/2723
>
> *Truly appreciate any inputs.*
> Supporting low latency writes to iceberg table entails the below
> sub-problems:
>
>    1. Optimizing the data payload: optimizing the data payload to be
>    written to the table.
>       - There is an Optimization that is specific to the write pattern
>          1. Appends: in case of Appends - this is already solved - as
>          iceberg always writes the new new inserts as new files.
>          2. Deletes (/Upserts): in case of Deletes (or Upserts - which
>          are broken down into Insert + Delete in 2.0) - this problem is 
> solved as
>          well.
>       - File Format: There is another optimization knob useful at file
>       format level. It might not make sense to generate data in the columnar
>       format here - al the time and space spent for encoding, storing stats 
> etc
>       (assuming the writes are small number of rows for ex: < 5) can be saved.
>       So, thankfully, for these low-latency writes with iceberg table -
>       AVRO file format can be used.
>    2. Locating the records that need to be updated with low-latency: In
>    case of Upserts - locating the records that need to be updated is the key
>    problem to be solved.
>       - One popular solution for this is to maintain indexes to support
>       the equality filters used for upserts. Do you know if there is any
>       ongoing effort for this!?
>    3. Optimizing the Metadata payload: for every write to Iceberg table -
>    the schema file & manifest list file are rewritten. To further push
>    the payload down - we can potentially write the "change set" here. Is this
>    the current direction of thought? *If so, pointers to any any work
>    stream in this regard is truly appreciated.*
>
>

Re: Low latency writes into Iceberg table

Reply via email to