For 3, it may not be worth adding extra complexity by introducing a "change set", unless we get solid data that shows writing a "change set" is faster than a complete rewrite.
Best, Yufei `This is not a contribution` On Tue, Jun 22, 2021 at 12:43 PM Sreeram Garlapati <[email protected]> wrote: > Hello Iceberg devs! > > Did any of you solve "low latency writes to Iceberg"? Overall, it boiled > down to 2 questions: > 1. Is there a way to add indexes to Iceberg table - to support equality > based filters (pl. see the point #2 below for more explanation) > 2. Is there a workstream to support writing delta's of metadata changes > (pl. see point #3 below)? > > Best, > Sreeram > > https://github.com/apache/iceberg/issues/2723 > > *Truly appreciate any inputs.* > Supporting low latency writes to iceberg table entails the below > sub-problems: > > 1. Optimizing the data payload: optimizing the data payload to be > written to the table. > - There is an Optimization that is specific to the write pattern > 1. Appends: in case of Appends - this is already solved - as > iceberg always writes the new new inserts as new files. > 2. Deletes (/Upserts): in case of Deletes (or Upserts - which > are broken down into Insert + Delete in 2.0) - this problem is > solved as > well. > - File Format: There is another optimization knob useful at file > format level. It might not make sense to generate data in the columnar > format here - al the time and space spent for encoding, storing stats > etc > (assuming the writes are small number of rows for ex: < 5) can be saved. > So, thankfully, for these low-latency writes with iceberg table - > AVRO file format can be used. > 2. Locating the records that need to be updated with low-latency: In > case of Upserts - locating the records that need to be updated is the key > problem to be solved. > - One popular solution for this is to maintain indexes to support > the equality filters used for upserts. Do you know if there is any > ongoing effort for this!? > 3. Optimizing the Metadata payload: for every write to Iceberg table - > the schema file & manifest list file are rewritten. To further push > the payload down - we can potentially write the "change set" here. Is this > the current direction of thought? *If so, pointers to any any work > stream in this regard is truly appreciated.* > >
