Iceberg community sync - 2020-03-25

Ryan Blue Fri, 27 Mar 2020 17:55:00 -0700

Hi everyone,

Here are my notes from the discussion. These are based mainly on my memory,
so feel free to correct or expand if you think it can be improved. Thanks!


*Agenda*

   - Cadence for syncs - every 2-4 weeks?
   - 0.8.0 Java release
   - Community building
   - Flink source and sink status
   - MR formats and Hive support status
   - Security (authorization, data values in metadata)
   - Row-level deletes (main discussion)

*Discussion*:

   - Sync cadence
      - Ryan: with syncs alternating time zones, 4 weeks is too long, but 2
      weeks is a lot for those of us attending all of them. How about 3 weeks?
      - Consensus was every 3 weeks
   - 0.8.0 Java release
      - When should we target for the release? Consensus was for Mid-April
      (3 weeks)
      - What do we want in the release? Main outstanding features are ORC
      support, Parquet vectorized reads, Spark/Hive changes
      - Ideally will include ORC support, since it is close
      - Hive version is 2.3 and should not block Hive work
      - Vectorized reads are nice-to-have but should not block a release
      - Can we disable consistent versions for Spark 2.4 and Spark 3.0
      support in the same repo? Ryan will dig up build script with baseline
      applied to only some modules, maybe we can disable it
   - Community building
      - Saisai suggested a Powered By page where we can post who is using
      Iceberg in production. Great idea!
      - Openinx suggested a blog section of the docs site
      - Ryan has concerns about blogs in docs - why not link to blogs on
      other platforms? We don’t want content to get stale or have the community
      “reviewing” content.
      - Owen: some blogs break links
   - Flink source and sinks status
      - Tencent data lake team posted a sink based on Netflix skunkworks,
      but needs to remove Netflix-specific features/dependencies
      - Issues opened for work to get sink in
      - Ryan: we’ll need reviewers because I’m not qualified. Will reach
      out to Steven Wu (Netflix sink author) and other people
interested in Flink.
      - Ryan: the Spark source is coming along, but the hardest part is
      getting a stream of files to process from table state. Is that
something we
      want to share between Spark and Flink implementations?
      - Probably want to share, if possible
   - Skipped MR/Hive status and security (will start dev list thread) to
   get to row-level deletes
   - Row-level deletes roadmap:
      - Ryan will be working on this more, with a doc for Spark MERGE INTO
      interfaces coming soon
      - This has been moving slowly because some parts, like sequence
      numbers, require forward-breaking/v2 changes
      - Owen suggested building two parallel write paths to be able to
      write v1. Everyone agreed with this
      - There are several projects that can be done by anyone and do not
      require forward-breaking/v2 changes: delete file format readers, writers,
      record iterator implementations to merge deletes (set-based,
merge-based),
      and specs for these once they are built
      - Junjie offered to work on file/position delete files
      - Equality delete merges are blocked on sort order addition to the
      format
      - Main blocking decision point is how to track delete files in
      manifests, Ryan will start a dev list thread
      - Openinx brought up concerns about minimizing end-to-end latency for
      a use case with high write volume for equality deletes
      - Ryan’s response was that this will likely require off-line
      optimization: write equality deletes from Flink but rewrite in a more
      efficient format (sorted, translated to file/position, etc.) in
a separate
      service. Enabling these services is the role of Iceberg, which is an
      at-rest format. Other approaches put this complexity into the writer, but
      it has to be done somewhere.
      - Gautam: what about GDPR deletes?
      - Ryan: GDPR deletes are a simpler case, where volume is much lower.
      That brings us back to the roadmap: let’s focus on simpler end-to-end use
      cases and get those done. Then we can work on scaling them. First things
      are to get the formats defined and documented, get a set-based delete
      filter implementation for equality deletes and a merge-based one for
      file/position deletes, and to add sequence numbers.
   - Thanks to everyone that attended! Will schedule the next sync for 3
   weeks from now.

-- 
Ryan Blue
Software Engineer
Netflix

Iceberg community sync - 2020-03-25

Reply via email to