Quick correction. These are notes for the sync that took place on *January 5th*, 9am-10am PT.
-Sam On Fri, Jan 14, 2022 at 6:21 PM Sam Redai <s...@tabular.io> wrote: > Hey Everyone! > > Here are the minutes and video recording from our Iceberg Sync that took > place on December 5th, 9am-10am PT. A quick reminder that since the > previous sync was pushed forward one week, we have a shorter window this > time and the next sync is this coming week on 01/19 at 9am PT. If you have > any highlights or agenda items, don't forget to include them in the live > doc > <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web> > . > > As always, anyone can join the discussion so feel free to share the > Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with > anyone who is seeking an invite. The notes and the agenda are posted in the > live > doc > <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web> > that's > also attached to the meeting invitation. > > Minutes: > > Meeting Recording ⭕ > <https://drive.google.com/file/d/1o-GQg0ER1Jco9RC1ayiXZi4tf4w1BUV_/view?ts=61d71e9c> > > Top of the Meeting Highlights > > - > > Lock Manager support for HadoopCatalog: The lock manager functionality > was recently added by Jack Ye. Nan has added support for this in the > HadoopCatalog! > - > > Expiration for the Cache Manager: This functionality has been added by > Kyle and is now configurable. This can deal with situations where Spark > would keep things around for a while. > - > > NOT_STARTS_WITH Operator: This is a significant addition by Kyle and > allows Iceberg to handle instances where Spark negates a STARTS WITH. > - > > Time Traveling w/ Schema Changes now Fixed: Wing Yew made an update to > allow time traveling to also include changing to the schema that was used > at the time. (Spark 2.4 to Spark 3.2) > - > > GCSFileIO: This was recently added by Dan. Also, thanks to Kyle for > testing it out! > - > > Spark vectorized reads with equality deletes: Yufei has this added and > working! > - > > DELETE, UPDATE, MERGE in Spark 3.2: The work is continuing for the > copy-on-write plans for Spark 3.2 and the 0.13.0 release will soon > unblocked. Thanks Anton! > - > > Rewrite data files stored procedure: Allows you to select portions of > the table to consider for rewrites. This was recently added by Ajantha! > > > Upcoming 0.13.0 Release > > - > > Bugfixes getting in very quickly > - > > Spark 3.2 support should be in by the end of the week to unblock this > (waiting on MERGE support) > - > > Release candidate is expected in ~1 week. > > > MergeOnRead feature > > - > > Anton is working on this currently. > - > > Support for this required moving all of the plans from the optimizer > in Spark into the analyzer. (Pretty significant change to how plans work in > Iceberg’s SQL extensions, and this is a Spark 3.2 change only) > > > - > > DELETE FROM is in an approved PR and should be merged soon > > Tagging and branching > > - > > The work on adding this to the java implementation is underway and the > more eyes on this the better > - > > This is a good time for considerations on how we identify branches and > tags in select queries in various engines > - > > How should branch history be used? > - > > More useful if time-traveling in a branch uses the history of that > branch instead of the current main branch > - > > This would require an update to the spec > - > > The proposed spec defines `min-snapshots-to-keep` and > `max-snapshot-age-ms` as the default for all branches, and then allow that > to be overridden for particular branches. > > Delete read optimization > > - > > Support for vectorized readers for positional deletes and equality > deletes has been merged in > - > > For non-vectorized reads, some memory optimizations are pending: PR # > 3535 <https://github.com/apache/iceberg/pull/3535> > > REST Catalog > > - > > There’s been a lot of discussions and the rest catalog spec is coming > together! > - > > The open API spec for namespace operations should be ready to merge > soon (create namespace, drop namespace, create namespace property, etc.) > - > > One of the goals here is to have a standardized API to enable more > flexible implementations that don’t require users to load a runtime jar. > Other goals include better conflict detection and handling cases where an > old writer drops refs. > - > > The REST catalog should also enable light wrapping of an existing > catalog implementation, i.e. JDBC, to expose a language-independent > interface with an existing catalog. (Although we most likely will not > include such a service as part of the open-source Iceberg project) > > Parquet/ORC Bloom Filter Support > > - > > There have been discussions in the past for taking advantage of bloom > filters available in the Parquet and ORC file formats > - > > Would most likely require reasonable configuration at the table level > (correctly configuring the filter may in fact be the most challenging part > of implementing this) > - > > It’s possible that additional complexity exists on the write-side when > factoring in schema evolution > > Potential Spark support for streaming change data feeds (out of Iceberg > tables) > > - > > Flink is currently implemented using pre-update, post-update, delete, > and insert and we should be able to do the same thing with Spark by adding > a reader or mode that uses that as the schema. > - > > An alternative to log segments is to read the previous snapshot and > the current snapshot and calculate the diff live. That’s made much easier > with merge-on-read. > - > > Calculating the diff live would have the challenge of determining > which record in the previous snapshot corresponds to an updated record in > the current snapshot > > > Have a great weekend! >