> The views from Netflix branch is a great feature, would have any plan to port to Apache Iceberg
I think we'd be willing to contribute the view support to the Apache project if everyone thinks it is a good idea for Iceberg to handle views. I don't want feature creep to cause the project to be difficult to maintain, but I think it does make some sense to add views since we already have so much code that could be shared for interacting with metastores. On Fri, Apr 17, 2020 at 9:27 AM Mass Dosage <massdos...@gmail.com> wrote: > Cool. I've raised a draft PR for the approach we discussed on the call: > > https://github.com/apache/incubator-iceberg/pull/935/files > > It's incomplete but I've put some notes explaining that, would be nice to > know what others think of the above approach and if they have better ideas. > > Another approach that we did successfully was to shade and relocate Guava > in every Iceberg subproject that used it, that way you can depend on it > "normally" but the build file is pretty messy with shadow jar versions of > everything etc. I can raise a WIP PR for that approach to compare if anyone > is interested. > > Thanks, > > Adrian > > > > On Fri, 17 Apr 2020 at 15:58, RD <rdsr...@gmail.com> wrote: > >> Thanks for the Correction Adrian. I've filed the ticket for github here: >> https://github.com/apache/incubator-iceberg/issues/934 . There are 2 >> approaches mentioned there with pros/cons. Will be good to get the >> community's feedback on how to proceed. >> >> -best, >> R. >> >> On Fri, Apr 17, 2020 at 6:28 AM Mass Dosage <massdos...@gmail.com> wrote: >> >>> Thanks for the detailed notes Ryan. My thoughts on a few of the topics... >>> >>> 0.8.0 release - my general preference is to release early and release >>> often. If features aren't ready why wait? Why not go with a 0.8.0 release >>> now and then a 0.9.0 (or whatever) a couple of weeks later with the other >>> features? I know with Apache projects this can sometimes be a challenge >>> with all the ceremony around a release, getting votes etc. but I don't >>> think that's such a problem in the incubating stage? >>> >>> A clarification on the InputFomats - I think the DDL Ratandeep was >>> referring to was more like "SHOW PARTITIONS" rather than "ADD PARTITIONS" >>> i.e. the "read" path but for statements other than "SELECT" etc. Also, to >>> be clear the `mapreduce` InputFormat that was contributed - it sounds like >>> that works for Pig but I don't think it will work for Hive 1 or 2 since >>> they use the `mapred` API for InputFormats. This is what we have attempted >>> to cover in our InputFormat. I raised a WIP PR for it yesterday at >>> https://github.com/apache/incubator-iceberg/pull/933 and would >>> appreciate feedback from anyone interested in it. >>> >>> Thanks for sharing the Avro hack for shading and relocating Guava. >>> Should I create a ticket on GitHub to capture this work? We'll then have a >>> go at implementing it. >>> >>> Thanks, >>> >>> Adrian >>> >>> >>> On Fri, 17 Apr 2020 at 04:07, OpenInx <open...@gmail.com> wrote: >>> >>>> Thanks for the writing. >>>> The views from Netflix branch is a great feature, would have any plan >>>> to port to Apache Iceberg ? >>>> >>>> On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue <rb...@netflix.com.invalid> >>>> wrote: >>>> >>>>> Here are my notes from yesterday’s sync. As usual, feel free to add to >>>>> this if I missed something. >>>>> >>>>> There were a couple of questions raised during the sync that we’d like >>>>> to open up to anyone who wasn’t able to attend: >>>>> >>>>> - Should we wait for the parallel metadata rewrite action before >>>>> cutting 0.8.0 candidates? >>>>> - Should we wait for ORC metrics before cutting 0.8.0 candidates? >>>>> >>>>> In the sync, we thought that it would be good to wait and get these >>>>> in. Please reply to this if you agree or disagree. >>>>> >>>>> Thanks! >>>>> >>>>> *Attendees*: >>>>> >>>>> - Ryan Blue >>>>> - Dan Weeks >>>>> - Anjali Norwood >>>>> - Jun Ma >>>>> - Ratandeep Ratti >>>>> - Pavan >>>>> - Christine Mathiesen >>>>> - Gautam Kowshik >>>>> - Mass Dosage >>>>> - Filip >>>>> - Ryan Murray >>>>> >>>>> *Topics*: >>>>> >>>>> - 0.8.0 release blockers: actions, ORC metrics >>>>> - Row-level delete update >>>>> - Parquet vectorized read update >>>>> - InputFormats and Hive support >>>>> - Netflix branch >>>>> >>>>> *Discussion*: >>>>> >>>>> - 0.8.0 release >>>>> - Ryan: we planned to get a candidate out this week, but I >>>>> think we may want to wait on 2 things that are about ready >>>>> - First: Anton is contributing an action to rewrite manifests >>>>> in parallel that is close. Anyone interested? (Gautam is interested) >>>>> - Second: ORC is passing correctness tests, but doesn’t have >>>>> column-level metrics. Should we wait for this? >>>>> - Ratandeep: ORC also lacks predicate push-down support >>>>> - Ryan: I think metrics are more important than PPD because PPD >>>>> is task side and metrics help reduce the number of tasks. If we >>>>> wait on >>>>> one, I’d prefer to wait on metrics >>>>> - Ratandeep will look into whether he or Shardul can work on >>>>> this >>>>> - General consensus was to wait for these features before >>>>> getting a candidate out >>>>> - Row-level deletes >>>>> - Good progress in several PRs on adding the parallel v2 write >>>>> path, as Owen suggested last sync >>>>> - Junjie contributed an update to the spec for file/position >>>>> delete files >>>>> - Parquet vectorized read >>>>> - Dan: flat schema reads are primarily waiting on reviews >>>>> - Dan: is anyone interested in complex type support? >>>>> - Gautam needs struct and map support. 0.14.0 doesn’t support >>>>> maps >>>>> - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but >>>>> not maps of structs >>>>> - Ryan (Blue): Because we have a translation layer in Iceberg >>>>> to pass off to Spark, we don’t actually need support in Arrow. We >>>>> are >>>>> currently stuck on 0.14.0 because of changes that prevent us from >>>>> avoiding >>>>> a null check (see this comment >>>>> >>>>> <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500> >>>>> ) >>>>> - >>>>> >>>>> InputFormat and Hive support >>>>> - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for >>>>> Pig and Hive; need to start working on the serde side and building >>>>> a Hive >>>>> StorageHandler, missing DDL support >>>>> - Ryan: What DDL support? >>>>> - >>>>> >>>>> Ratandeep: Statements like ADD PARTITION >>>>> - >>>>> >>>>> Ryan: How would all of this work in Hive? It isn’t clear what >>>>> components are needed right now: StorageHandler? RawStore? >>>>> HiveMetaHook? >>>>> - Ratandeep: Currently working on only the read path, which >>>>> requires a StorageHandler. The write path would be more difficult. >>>>> - Mass Dosage: Working on a (mapred) InputFormat for Hive in >>>>> iceberg-mr, started working on a serde in iceberg-hive. Interested >>>>> in >>>>> writes, but not in the short or medium term >>>>> - Mass Dosage: The main problem is dependency conflicts between >>>>> Hive and Iceberg, mainly Guava >>>>> - Ryan: Anyone know a good replacement for Guava collections? >>>>> - Ryan: In Avro, we have a module that shades Guava >>>>> >>>>> <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml> >>>>> and has a class with references >>>>> >>>>> <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>. >>>>> Then shading can minimize the shaded classes. We could do that here. >>>>> - Ryan: Is Jackson also a problem? >>>>> - Mass Dosage: Yes, and calcite >>>>> - Ryan: Calcite probably isn’t referenced directly so we can >>>>> hopefully avoid the consistent versions problem by excluding it >>>>> - Netflix branch of Iceberg (with non-Iceberg additions) >>>>> - Ryan: We’ve published a copy of our current Iceberg >>>>> 0.7.0-based branch >>>>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4> for >>>>> Spark 2.4 with DSv2 backported >>>>> <https://github.com/Netflix/spark> >>>>> - The purpose of this is to share non-Iceberg work that we use >>>>> to compliment Iceberg, like views, catalogs, and DSv2 tables >>>>> - Views are SQL views >>>>> >>>>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view> >>>>> that are stored and versioned like Iceberg metadata. This is how we >>>>> are >>>>> tracking views for Presto and Spark (coral integration would be >>>>> nice!). We >>>>> are contributing the Spark DSv2 ViewCatalog to upstream Spark >>>>> - Metacat is an open metastore project from Netflix. The metacat >>>>> package contains our metastore integration >>>>> >>>>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat> >>>>> for it. >>>>> - The batch package contains Spark and Hive table >>>>> implementations for Spark’s DSv2 >>>>> >>>>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>, >>>>> which we use for multi-catalog support. >>>>> - Gautam: how will migration to Iceberg’s v2 format work for those >>>>> of us in production using v1? >>>>> - Ryan: Tables are explicitly updated to v2 and both v1 and v2 >>>>> will be supported in parallel. Using v1 until everything is updated >>>>> with v2 >>>>> support takes care of forward-compatibility issues. This can be >>>>> done on a >>>>> per-table basis >>>>> - Gautam: Does migration require rewriting metadata? >>>>> - Ryan: No, the format is backward compatible with v1, so the >>>>> update is metadata-only until the writers start using new metadata >>>>> that v1 >>>>> would ignore (deletes) and would incorrectly modify if it were to >>>>> write to >>>>> v2. >>>>> - Ryan: Also, Iceberg already has a forward-compatibility check >>>>> that will prevent v1 readers from loading a v2 table. >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> -- Ryan Blue Software Engineer Netflix