Re: Iceberg community sync notes - 15 April 2020

Ryan Blue Mon, 20 Apr 2020 17:28:57 -0700

> The views from Netflix branch is a great feature, would have any plan to
port to Apache Iceberg


I think we'd be willing to contribute the view support to the Apache
project if everyone thinks it is a good idea for Iceberg to handle views.

I don't want feature creep to cause the project to be difficult to
maintain, but I think it does make some sense to add views since we already
have so much code that could be shared for interacting with metastores.

On Fri, Apr 17, 2020 at 9:27 AM Mass Dosage <massdos...@gmail.com> wrote:

> Cool. I've raised a draft PR for the approach we discussed on the call:
>
> https://github.com/apache/incubator-iceberg/pull/935/files
>
> It's incomplete but I've put some notes explaining that, would be nice to
> know what others think of the above approach and if they have better ideas.
>
> Another approach that we did successfully was to shade and relocate Guava
> in every Iceberg subproject that used it, that way you can depend on it
> "normally" but the build file is pretty messy with shadow jar versions of
> everything etc. I can raise a WIP PR for that approach to compare if anyone
> is interested.
>
> Thanks,
>
> Adrian
>
>
>
> On Fri, 17 Apr 2020 at 15:58, RD <rdsr...@gmail.com> wrote:
>
>> Thanks for the Correction Adrian.  I've filed the ticket for github here:
>> https://github.com/apache/incubator-iceberg/issues/934 . There are 2
>> approaches mentioned there with pros/cons. Will be good to get the
>> community's feedback on how to proceed.
>>
>> -best,
>> R.
>>
>> On Fri, Apr 17, 2020 at 6:28 AM Mass Dosage <massdos...@gmail.com> wrote:
>>
>>> Thanks for the detailed notes Ryan. My thoughts on a few of the topics...
>>>
>>> 0.8.0 release - my general preference is to release early and release
>>> often. If features aren't ready why wait? Why not go with a 0.8.0 release
>>> now and then a 0.9.0 (or whatever) a couple of weeks later with the other
>>> features? I know with Apache projects this can sometimes be a challenge
>>> with all the ceremony around a release, getting votes etc. but I don't
>>> think that's such a problem in the incubating stage?
>>>
>>> A clarification on the InputFomats - I think the DDL Ratandeep was
>>> referring to was more like "SHOW PARTITIONS" rather than "ADD PARTITIONS"
>>> i.e. the "read" path but for statements other than "SELECT" etc. Also, to
>>> be clear the `mapreduce` InputFormat that was contributed - it sounds like
>>> that works for Pig but I don't think it will work for Hive 1 or 2 since
>>> they use the `mapred` API for InputFormats. This is what we have attempted
>>> to cover in our InputFormat. I raised a WIP PR for it yesterday at
>>> https://github.com/apache/incubator-iceberg/pull/933 and would
>>> appreciate feedback from anyone interested in it.
>>>
>>> Thanks for sharing the Avro hack for shading and relocating Guava.
>>> Should I create a ticket on GitHub to capture this work? We'll then have a
>>> go at implementing it.
>>>
>>> Thanks,
>>>
>>> Adrian
>>>
>>>
>>> On Fri, 17 Apr 2020 at 04:07, OpenInx <open...@gmail.com> wrote:
>>>
>>>> Thanks for the writing.
>>>> The views from Netflix branch is a great feature, would have any plan
>>>> to port to Apache Iceberg ?
>>>>
>>>> On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Here are my notes from yesterday’s sync. As usual, feel free to add to
>>>>> this if I missed something.
>>>>>
>>>>> There were a couple of questions raised during the sync that we’d like
>>>>> to open up to anyone who wasn’t able to attend:
>>>>>
>>>>>    - Should we wait for the parallel metadata rewrite action before
>>>>>    cutting 0.8.0 candidates?
>>>>>    - Should we wait for ORC metrics before cutting 0.8.0 candidates?
>>>>>
>>>>> In the sync, we thought that it would be good to wait and get these
>>>>> in. Please reply to this if you agree or disagree.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> *Attendees*:
>>>>>
>>>>>    - Ryan Blue
>>>>>    - Dan Weeks
>>>>>    - Anjali Norwood
>>>>>    - Jun Ma
>>>>>    - Ratandeep Ratti
>>>>>    - Pavan
>>>>>    - Christine Mathiesen
>>>>>    - Gautam Kowshik
>>>>>    - Mass Dosage
>>>>>    - Filip
>>>>>    - Ryan Murray
>>>>>
>>>>> *Topics*:
>>>>>
>>>>>    - 0.8.0 release blockers: actions, ORC metrics
>>>>>    - Row-level delete update
>>>>>    - Parquet vectorized read update
>>>>>    - InputFormats and Hive support
>>>>>    - Netflix branch
>>>>>
>>>>> *Discussion*:
>>>>>
>>>>>    - 0.8.0 release
>>>>>       - Ryan: we planned to get a candidate out this week, but I
>>>>>       think we may want to wait on 2 things that are about ready
>>>>>       - First: Anton is contributing an action to rewrite manifests
>>>>>       in parallel that is close. Anyone interested? (Gautam is interested)
>>>>>       - Second: ORC is passing correctness tests, but doesn’t have
>>>>>       column-level metrics. Should we wait for this?
>>>>>       - Ratandeep: ORC also lacks predicate push-down support
>>>>>       - Ryan: I think metrics are more important than PPD because PPD
>>>>>       is task side and metrics help reduce the number of tasks. If we 
>>>>> wait on
>>>>>       one, I’d prefer to wait on metrics
>>>>>       - Ratandeep will look into whether he or Shardul can work on
>>>>>       this
>>>>>       - General consensus was to wait for these features before
>>>>>       getting a candidate out
>>>>>    - Row-level deletes
>>>>>       - Good progress in several PRs on adding the parallel v2 write
>>>>>       path, as Owen suggested last sync
>>>>>       - Junjie contributed an update to the spec for file/position
>>>>>       delete files
>>>>>    - Parquet vectorized read
>>>>>       - Dan: flat schema reads are primarily waiting on reviews
>>>>>       - Dan: is anyone interested in complex type support?
>>>>>       - Gautam needs struct and map support. 0.14.0 doesn’t support
>>>>>       maps
>>>>>       - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but
>>>>>       not maps of structs
>>>>>       - Ryan (Blue): Because we have a translation layer in Iceberg
>>>>>       to pass off to Spark, we don’t actually need support in Arrow. We 
>>>>> are
>>>>>       currently stuck on 0.14.0 because of changes that prevent us from 
>>>>> avoiding
>>>>>       a null check (see this comment
>>>>>       
>>>>> <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500>
>>>>>       )
>>>>>    -
>>>>>
>>>>>    InputFormat and Hive support
>>>>>    - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for
>>>>>       Pig and Hive; need to start working on the serde side and building 
>>>>> a Hive
>>>>>       StorageHandler, missing DDL support
>>>>>       - Ryan: What DDL support?
>>>>>       -
>>>>>
>>>>>       Ratandeep: Statements like ADD PARTITION
>>>>>       -
>>>>>
>>>>>       Ryan: How would all of this work in Hive? It isn’t clear what
>>>>>       components are needed right now: StorageHandler? RawStore? 
>>>>> HiveMetaHook?
>>>>>       - Ratandeep: Currently working on only the read path, which
>>>>>       requires a StorageHandler. The write path would be more difficult.
>>>>>       - Mass Dosage: Working on a (mapred) InputFormat for Hive in
>>>>>       iceberg-mr, started working on a serde in iceberg-hive. Interested 
>>>>> in
>>>>>       writes, but not in the short or medium term
>>>>>       - Mass Dosage: The main problem is dependency conflicts between
>>>>>       Hive and Iceberg, mainly Guava
>>>>>       - Ryan: Anyone know a good replacement for Guava collections?
>>>>>       - Ryan: In Avro, we have a module that shades Guava
>>>>>       
>>>>> <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml>
>>>>>       and has a class with references
>>>>>       
>>>>> <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>.
>>>>>       Then shading can minimize the shaded classes. We could do that here.
>>>>>       - Ryan: Is Jackson also a problem?
>>>>>       - Mass Dosage: Yes, and calcite
>>>>>       - Ryan: Calcite probably isn’t referenced directly so we can
>>>>>       hopefully avoid the consistent versions problem by excluding it
>>>>>    - Netflix branch of Iceberg (with non-Iceberg additions)
>>>>>       - Ryan: We’ve published a copy of our current Iceberg
>>>>>       0.7.0-based branch
>>>>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4> for
>>>>>       Spark 2.4 with DSv2 backported
>>>>>       <https://github.com/Netflix/spark>
>>>>>       - The purpose of this is to share non-Iceberg work that we use
>>>>>       to compliment Iceberg, like views, catalogs, and DSv2 tables
>>>>>       - Views are SQL views
>>>>>       
>>>>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view>
>>>>>       that are stored and versioned like Iceberg metadata. This is how we 
>>>>> are
>>>>>       tracking views for Presto and Spark (coral integration would be 
>>>>> nice!). We
>>>>>       are contributing the Spark DSv2 ViewCatalog to upstream Spark
>>>>>       - Metacat is an open metastore project from Netflix. The metacat
>>>>>       package contains our metastore integration
>>>>>       
>>>>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat>
>>>>>       for it.
>>>>>       - The batch package contains Spark and Hive table
>>>>>       implementations for Spark’s DSv2
>>>>>       
>>>>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>,
>>>>>       which we use for multi-catalog support.
>>>>>    - Gautam: how will migration to Iceberg’s v2 format work for those
>>>>>    of us in production using v1?
>>>>>       - Ryan: Tables are explicitly updated to v2 and both v1 and v2
>>>>>       will be supported in parallel. Using v1 until everything is updated 
>>>>> with v2
>>>>>       support takes care of forward-compatibility issues. This can be 
>>>>> done on a
>>>>>       per-table basis
>>>>>       - Gautam: Does migration require rewriting metadata?
>>>>>       - Ryan: No, the format is backward compatible with v1, so the
>>>>>       update is metadata-only until the writers start using new metadata 
>>>>> that v1
>>>>>       would ignore (deletes) and would incorrectly modify if it were to 
>>>>> write to
>>>>>       v2.
>>>>>       - Ryan: Also, Iceberg already has a forward-compatibility check
>>>>>       that will prevent v1 readers from loading a v2 table.
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg community sync notes - 15 April 2020

Reply via email to