Re: Meeting Minutes from 01/05 Iceberg Sync

Sam Redai Fri, 14 Jan 2022 18:26:00 -0800

Quick correction. These are notes for the sync that took place on *January
5th*, 9am-10am PT.


-Sam

On Fri, Jan 14, 2022 at 6:21 PM Sam Redai <s...@tabular.io> wrote:

> Hey Everyone!
>
> Here are the minutes and video recording from our Iceberg Sync that took
> place on December 5th, 9am-10am PT.  A quick reminder that since the
> previous sync was pushed forward one week, we have a shorter window this
> time and the next sync is this coming week on 01/19 at 9am PT. If you have
> any highlights or agenda items, don't forget to include them in the live
> doc
> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
> .
>
> As always, anyone can join the discussion so feel free to share the
> Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with
> anyone who is seeking an invite. The notes and the agenda are posted in the 
> live
> doc
> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
>  that's
> also attached to the meeting invitation.
>
> Minutes:
>
> Meeting Recording ⭕
> <https://drive.google.com/file/d/1o-GQg0ER1Jco9RC1ayiXZi4tf4w1BUV_/view?ts=61d71e9c>
>
> Top of the Meeting Highlights
>
>    -
>
>    Lock Manager support for HadoopCatalog: The lock manager functionality
>    was recently added by Jack Ye. Nan has added support for this in the
>    HadoopCatalog!
>    -
>
>    Expiration for the Cache Manager: This functionality has been added by
>    Kyle and is now configurable. This can deal with situations where Spark
>    would keep things around for a while.
>    -
>
>    NOT_STARTS_WITH Operator: This is a significant addition by Kyle and
>    allows Iceberg to handle instances where Spark negates a STARTS WITH.
>    -
>
>    Time Traveling w/ Schema Changes now Fixed: Wing Yew made an update to
>    allow time traveling to also include changing to the schema that was used
>    at the time. (Spark 2.4 to Spark 3.2)
>    -
>
>    GCSFileIO: This was recently added by Dan. Also, thanks to Kyle for
>    testing it out!
>    -
>
>    Spark vectorized reads with equality deletes: Yufei has this added and
>    working!
>    -
>
>    DELETE, UPDATE, MERGE in Spark 3.2: The work is continuing for the
>    copy-on-write plans for Spark 3.2 and the 0.13.0 release will soon
>    unblocked. Thanks Anton!
>    -
>
>    Rewrite data files stored procedure: Allows you to select portions of
>    the table to consider for rewrites. This was recently added by Ajantha!
>
>
> Upcoming 0.13.0 Release
>
>    -
>
>    Bugfixes getting in very quickly
>    -
>
>    Spark 3.2 support should be in by the end of the week to unblock this
>    (waiting on MERGE support)
>    -
>
>    Release candidate is expected in ~1 week.
>
>
> MergeOnRead feature
>
>    -
>
>    Anton is working on this currently.
>    -
>
>    Support for this required moving all of the plans from the optimizer
>    in Spark into the analyzer. (Pretty significant change to how plans work in
>    Iceberg’s SQL extensions, and this is a Spark 3.2 change only)
>
>
>    -
>
>    DELETE FROM is in an approved PR and should be merged soon
>
> Tagging and branching
>
>    -
>
>    The work on adding this to the java implementation is underway and the
>    more eyes on this the better
>    -
>
>    This is a good time for considerations on how we identify branches and
>    tags in select queries in various engines
>    -
>
>    How should branch history be used?
>    -
>
>       More useful if time-traveling in a branch uses the history of that
>       branch instead of the current main branch
>       -
>
>       This would require an update to the spec
>       -
>
>    The proposed spec defines `min-snapshots-to-keep` and
>    `max-snapshot-age-ms` as the default for all branches, and then allow that
>    to be overridden for particular branches.
>
> Delete read optimization
>
>    -
>
>    Support for vectorized readers for positional deletes and equality
>    deletes has been merged in
>    -
>
>    For non-vectorized reads, some memory optimizations are pending: PR #
>    3535 <https://github.com/apache/iceberg/pull/3535>
>
> REST Catalog
>
>    -
>
>    There’s been a lot of discussions and the rest catalog spec is coming
>    together!
>    -
>
>    The open API spec for namespace operations should be ready to merge
>    soon (create namespace, drop namespace, create namespace property, etc.)
>    -
>
>    One of the goals here is to have a standardized API to enable more
>    flexible implementations that don’t require users to load a runtime jar.
>    Other goals include better conflict detection and handling cases where an
>    old writer drops refs.
>    -
>
>    The REST catalog should also enable light wrapping of an existing
>    catalog implementation, i.e. JDBC, to expose a language-independent
>    interface with an existing catalog. (Although we most likely will not
>    include such a service as part of the open-source Iceberg project)
>
> Parquet/ORC Bloom Filter Support
>
>    -
>
>    There have been discussions in the past for taking advantage of bloom
>    filters available in the Parquet and ORC file formats
>    -
>
>    Would most likely require reasonable configuration at the table level
>    (correctly configuring the filter may in fact be the most challenging part
>    of implementing this)
>    -
>
>    It’s possible that additional complexity exists on the write-side when
>    factoring in schema evolution
>
> Potential Spark support for streaming change data feeds (out of Iceberg
> tables)
>
>    -
>
>    Flink is currently implemented using pre-update, post-update, delete,
>    and insert and we should be able to do the same thing with Spark by adding
>    a reader or mode that uses that as the schema.
>    -
>
>    An alternative to log segments is to read the previous snapshot and
>    the current snapshot and calculate the diff live. That’s made much easier
>    with merge-on-read.
>    -
>
>    Calculating the diff live would have the challenge of determining
>    which record in the previous snapshot corresponds to an updated record in
>    the current snapshot
>
>
> Have a great weekend!
>

Re: Meeting Minutes from 01/05 Iceberg Sync

Reply via email to