Hi everyone, Here are my notes from the discussion. These are based mainly on my memory, so feel free to correct or expand if you think it can be improved. Thanks!
*Agenda* - Cadence for syncs - every 2-4 weeks? - 0.8.0 Java release - Community building - Flink source and sink status - MR formats and Hive support status - Security (authorization, data values in metadata) - Row-level deletes (main discussion) *Discussion*: - Sync cadence - Ryan: with syncs alternating time zones, 4 weeks is too long, but 2 weeks is a lot for those of us attending all of them. How about 3 weeks? - Consensus was every 3 weeks - 0.8.0 Java release - When should we target for the release? Consensus was for Mid-April (3 weeks) - What do we want in the release? Main outstanding features are ORC support, Parquet vectorized reads, Spark/Hive changes - Ideally will include ORC support, since it is close - Hive version is 2.3 and should not block Hive work - Vectorized reads are nice-to-have but should not block a release - Can we disable consistent versions for Spark 2.4 and Spark 3.0 support in the same repo? Ryan will dig up build script with baseline applied to only some modules, maybe we can disable it - Community building - Saisai suggested a Powered By page where we can post who is using Iceberg in production. Great idea! - Openinx suggested a blog section of the docs site - Ryan has concerns about blogs in docs - why not link to blogs on other platforms? We don’t want content to get stale or have the community “reviewing” content. - Owen: some blogs break links - Flink source and sinks status - Tencent data lake team posted a sink based on Netflix skunkworks, but needs to remove Netflix-specific features/dependencies - Issues opened for work to get sink in - Ryan: we’ll need reviewers because I’m not qualified. Will reach out to Steven Wu (Netflix sink author) and other people interested in Flink. - Ryan: the Spark source is coming along, but the hardest part is getting a stream of files to process from table state. Is that something we want to share between Spark and Flink implementations? - Probably want to share, if possible - Skipped MR/Hive status and security (will start dev list thread) to get to row-level deletes - Row-level deletes roadmap: - Ryan will be working on this more, with a doc for Spark MERGE INTO interfaces coming soon - This has been moving slowly because some parts, like sequence numbers, require forward-breaking/v2 changes - Owen suggested building two parallel write paths to be able to write v1. Everyone agreed with this - There are several projects that can be done by anyone and do not require forward-breaking/v2 changes: delete file format readers, writers, record iterator implementations to merge deletes (set-based, merge-based), and specs for these once they are built - Junjie offered to work on file/position delete files - Equality delete merges are blocked on sort order addition to the format - Main blocking decision point is how to track delete files in manifests, Ryan will start a dev list thread - Openinx brought up concerns about minimizing end-to-end latency for a use case with high write volume for equality deletes - Ryan’s response was that this will likely require off-line optimization: write equality deletes from Flink but rewrite in a more efficient format (sorted, translated to file/position, etc.) in a separate service. Enabling these services is the role of Iceberg, which is an at-rest format. Other approaches put this complexity into the writer, but it has to be done somewhere. - Gautam: what about GDPR deletes? - Ryan: GDPR deletes are a simpler case, where volume is much lower. That brings us back to the roadmap: let’s focus on simpler end-to-end use cases and get those done. Then we can work on scaling them. First things are to get the formats defined and documented, get a set-based delete filter implementation for equality deletes and a merge-based one for file/position deletes, and to add sequence numbers. - Thanks to everyone that attended! Will schedule the next sync for 3 weeks from now. -- Ryan Blue Software Engineer Netflix