Re: [Early Feedback] Variant and Subcolumnarization Support

Micah Kornfield Mon, 22 Jul 2024 21:36:06 -0700

Sorry for the late reply.  I agree with the sentiments on 1 and 3 that have
already been posted (adopt the Spark encoding, and only have the Variant
type).  As mentioned on the doc for 3, I think it would be good to specify
how to map scalar types to a JSON representation so there can be
consistency between engines that don't support variant.



> Regarding point 2, I also feel Iceberg is more natural to host such a
> subproject for variant spec and implementation. But let me reach out to the
> Spark community to discuss.


The only  other place I can think of that might be a good home for Variant
spec could be in Apache Arrow as a canonical extension type. There is an
issue for this [1].  I think the main thing on where this is housed is
which types are intended to be supported.  I believe Arrow is currently a
superset of the Iceberg type system (UUID is supported as a canonical
extension type [2]).

For point 4 subcolumnarization, I think ideally this belongs in Iceberg
(and if Iceberg and Delta Lake can agree on how to do it that would be
great) with potential consultation with Parquet/ORC communities to
potentially add better native support.

Thanks,
Micah



[1] https://github.com/apache/arrow/issues/42069
[2] https://arrow.apache.org/docs/format/CanonicalExtensions.html

On Sat, Jul 20, 2024 at 5:54 PM Aihua Xu <aihu...@gmail.com> wrote:

> Thanks for the discussion and feedback.
>
> Do we have the consensus on point 1 and point 3 to move forward with Spark
> variant encoding and support Variant type only? Or let me know how to
> proceed from here.
>
> Regarding point 2, I also feel Iceberg is more natural to host such a
> subproject for variant spec and implementation. But let me reach out to the
> Spark community to discuss.
>
> Thanks,
> Aihua
>
>
> On Fri, Jul 19, 2024 at 9:35 AM Yufei Gu <flyrain...@gmail.com> wrote:
>
>> Agreed with point 1.
>>
>> For point 2, I also prefer to hold the spec and reference implementation
>> under Iceberg. Here are the reasons:
>> 1. It is unconventional and impractical for one engine to depend on
>> another for data types. For instance, it is not ideal for Trino to rely on
>> data types defined by the Spark engine.
>> 2. Iceberg serves as a bridge between engines and file formats. By
>> centralizing the specification in Iceberg, any future optimizations or
>> updates to file formats can be referred to within Iceberg, ensuring
>> consistency and reducing dependencies.
>>
>> For point 3, I'd prefer to support the variant type only at this moment.
>>
>> Yufei
>>
>>
>> On Thu, Jul 18, 2024 at 12:55 PM Ryan Blue <b...@databricks.com.invalid>
>> wrote:
>>
>>> Similarly, I'm aligned with point 1 and I'd choose to support only
>>> variant for point 3.
>>>
>>> We'll need to work with the Spark community to find a good place for the
>>> library and spec, since it touches many different projects. I'd also prefer
>>> Iceberg as the home.
>>>
>>> I also think it's a good idea to get subcolumnarization into our spec
>>> when we update. Without that I think the feature will be fairly limited.
>>>
>>> On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> I'm aligned with point 1.
>>>>
>>>> For point 2 I think we should choose quickly, I honestly do think this
>>>> would be fine as part of the Iceberg Spec directly but understand it may be
>>>> better for the more broad community if it was a sub project. As a
>>>> sub-project I would still prefer it being an Iceberg Subproject since we
>>>> are engine/file-format agnostic.
>>>>
>>>> 3. I support adding just Variant.
>>>>
>>>> On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <aihu...@apache.org> wrote:
>>>>
>>>>> Hello community,
>>>>>
>>>>> It’s great to sync up with some of you on Variant and
>>>>> SubColumarization support in Iceberg again. Apologize that I didn’t record
>>>>> the meeting but here are some key items that we want to follow up with the
>>>>> community.
>>>>>
>>>>> 1. Adopt Spark Variant encoding
>>>>> Those present were in favor of  adopting the Spark variant encoding
>>>>> for Iceberg Variant with extensions to support other Iceberg types. We
>>>>> would like to know if anyone has an objection to this to reuse an open
>>>>> source encoding.
>>>>>
>>>>> 2. Movement of the Spark Variant Spec to another project
>>>>> To avoid introducing Apache Spark as a dependency for the engines and
>>>>> file formats, we discussed separating Spark Variant encoding spec and
>>>>> implementation from the Spark Project to a neutral location. We thought up
>>>>> several solutions but didn’t have consensus on any of them. We are looking
>>>>> for more feedback on this topic from the community either in terms of
>>>>> support for one of these options or another idea on how to support the 
>>>>> spec.
>>>>>
>>>>> Options Proposed:
>>>>> * Leave the Spec in Spark (Difficult for versioning and other engines)
>>>>> * Copying the Spec into Iceberg Project Directly (Difficult for other
>>>>> Table Formats)
>>>>> * Creating a Sub-Project of Apache Iceberg and moving the spec and
>>>>> reference implementation there (Logistically complicated)
>>>>> * Creating a Sub-Project of Apache Spark and moving the spec and
>>>>> reference implementation there (Logistically complicated)
>>>>>
>>>>> 3. Add Variant type vs. Variant and JSON types
>>>>> Those who were present were in favor of adding only the Variant type
>>>>> to Iceberg. We are looking for anyone who has an objection to going 
>>>>> forward
>>>>> with just the Variant Type and no Iceberg JSON Type. We were favoring
>>>>> adding Variant type only because:
>>>>> * Introducing a JSON type would require engines that only support
>>>>> VARIANT to do write time validation of their input to a JSON column. If
>>>>> they don’t have a JSON type an engine wouldn’t support this.
>>>>> * Engines which don’t support Variant will work most of the time but
>>>>> can have fallback strings defined in the spec for reading unsupported
>>>>> types. Writing a JSON into a Variant will always work.
>>>>>
>>>>> 4. Support for Subcolumnization spec (shredding in Spark)
>>>>> We have no action items on this but would like to follow up on
>>>>> discussions on Subcolumnization in the future.
>>>>> * We had general agreement that this should be included in Iceberg V3
>>>>> or else adding variant may not be useful.
>>>>> * We are interested in also adopting the shredding spec from Spark and
>>>>> would like to move it to whatever place we decided the Variant spec is
>>>>> going to live.
>>>>>
>>>>> Let us know if missed anything and if you have any additional thoughts
>>>>> or suggestions.
>>>>>
>>>>> Thanks
>>>>> Aihua
>>>>>
>>>>>
>>>>> On 2024/07/15 18:32:22 Aihua Xu wrote:
>>>>> > Thanks for the discussion.
>>>>> >
>>>>> > I will move forward to work on spec PR.
>>>>> >
>>>>> > Regarding the implementation, we will have module for Variant
>>>>> support in Iceberg so we will not have to bring in Spark libraries.
>>>>> >
>>>>> > I'm reposting the meeting invite in case it's not clear in my
>>>>> original email since I included in the end. Looks like we don't have major
>>>>> objections/diverges but let's sync up and have consensus.
>>>>> >
>>>>> > Meeting invite:
>>>>> >
>>>>> > Wednesday, July 17 · 9:00 – 10:00am
>>>>> > Time zone: America/Los_Angeles
>>>>> > Google Meet joining info
>>>>> > Video call link: https://meet.google.com/pbm-ovzn-aoq
>>>>> > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>>>> > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>>> >
>>>>> > Thanks,
>>>>> > Aihua
>>>>> >
>>>>> > On 2024/07/12 20:55:01 Micah Kornfield wrote:
>>>>> > > I don't think this needs to hold up the PR but I think coming to a
>>>>> > > consensus on the exact set of types supported is worthwhile (and
>>>>> if the
>>>>> > > goal is to maintain the same set as specified by the Spark Variant
>>>>> type or
>>>>> > > if divergence is expected/allowed).  From a fragmentation
>>>>> perspective it
>>>>> > > would be a shame if they diverge, so maybe a next step is also
>>>>> suggesting
>>>>> > > support to the Spark community on the missing existing Iceberg
>>>>> types?
>>>>> > >
>>>>> > > Thanks,
>>>>> > > Micah
>>>>> > >
>>>>> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer <
>>>>> russell.spit...@gmail.com>
>>>>> > > wrote:
>>>>> > >
>>>>> > > > Just talked with Aihua and he's working on the Spec PR now. We
>>>>> can get
>>>>> > > > feedback there from everyone.
>>>>> > > >
>>>>> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue
>>>>> <b...@databricks.com.invalid>
>>>>> > > > wrote:
>>>>> > > >
>>>>> > > >> Good idea, but I'm hoping that we can continue to get their
>>>>> feedback in
>>>>> > > >> parallel to getting the spec changes started. Piotr didn't seem
>>>>> to object
>>>>> > > >> to the encoding from what I read of his comments. Hopefully he
>>>>> (and others)
>>>>> > > >> chime in here.
>>>>> > > >>
>>>>> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer <
>>>>> > > >> russell.spit...@gmail.com> wrote:
>>>>> > > >>
>>>>> > > >>> I just want to make sure we get Piotr and Peter on board as
>>>>> > > >>> representatives of Flink and Trino engines. Also make sure we
>>>>> have anyone
>>>>> > > >>> else chime in who has experience with Ray if possible.
>>>>> > > >>>
>>>>> > > >>> Spec changes feel like the right next step.
>>>>> > > >>>
>>>>> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue
>>>>> <b...@databricks.com.invalid>
>>>>> > > >>> wrote:
>>>>> > > >>>
>>>>> > > >>>> Okay, what are the next steps here? This proposal has been
>>>>> out for
>>>>> > > >>>> quite a while and I don't see any major objections to using
>>>>> the Spark
>>>>> > > >>>> encoding. It's quite well designed and fits the need well. It
>>>>> can also be
>>>>> > > >>>> extended to support additional types that are missing if
>>>>> that's a priority.
>>>>> > > >>>>
>>>>> > > >>>> Should we move forward by starting a draft of the changes to
>>>>> the table
>>>>> > > >>>> spec? Then we can vote on committing those changes and get
>>>>> moving on an
>>>>> > > >>>> implementation (or possibly do the implementation in
>>>>> parallel).
>>>>> > > >>>>
>>>>> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer <
>>>>> > > >>>> russell.spit...@gmail.com> wrote:
>>>>> > > >>>>
>>>>> > > >>>>> That's fair, I'm sold on an Iceberg Module.
>>>>> > > >>>>>
>>>>> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue
>>>>> <b...@databricks.com.invalid>
>>>>> > > >>>>> wrote:
>>>>> > > >>>>>
>>>>> > > >>>>>> > Feels like eventually the encoding should land in parquet
>>>>> proper
>>>>> > > >>>>>> right?
>>>>> > > >>>>>>
>>>>> > > >>>>>> What about using it in ORC? I don't know where it should
>>>>> end up.
>>>>> > > >>>>>> Maybe Iceberg should make a standalone module from it?
>>>>> > > >>>>>>
>>>>> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <
>>>>> > > >>>>>> russell.spit...@gmail.com> wrote:
>>>>> > > >>>>>>
>>>>> > > >>>>>>> Feels like eventually the encoding should land in parquet
>>>>> proper
>>>>> > > >>>>>>> right? I'm fine with us just copying into Iceberg though
>>>>> for the time
>>>>> > > >>>>>>> being.
>>>>> > > >>>>>>>
>>>>> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue
>>>>> > > >>>>>>> <b...@databricks.com.invalid> wrote:
>>>>> > > >>>>>>>
>>>>> > > >>>>>>>> Oops, it looks like I missed where Aihua brought this up
>>>>> in his
>>>>> > > >>>>>>>> last email:
>>>>> > > >>>>>>>>
>>>>> > > >>>>>>>> > do we have an issue to directly use Spark
>>>>> implementation in
>>>>> > > >>>>>>>> Iceberg?
>>>>> > > >>>>>>>>
>>>>> > > >>>>>>>> Yes, I think that we do have an issue using the Spark
>>>>> library. What
>>>>> > > >>>>>>>> do you think about a Java implementation in Iceberg?
>>>>> > > >>>>>>>>
>>>>> > > >>>>>>>> Ryan
>>>>> > > >>>>>>>>
>>>>> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <
>>>>> b...@databricks.com>
>>>>> > > >>>>>>>> wrote:
>>>>> > > >>>>>>>>
>>>>> > > >>>>>>>>> I raised the same point from Peter's email in a comment
>>>>> on the doc
>>>>> > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact that
>>>>> would be a much
>>>>> > > >>>>>>>>> smaller scope than relying on large portions of Spark,
>>>>> but I even then I
>>>>> > > >>>>>>>>> doubt that it is a good idea for Iceberg to depend on
>>>>> that because it is a
>>>>> > > >>>>>>>>> Scala artifact and we would need to bring in a ton of
>>>>> Scala libs. I think
>>>>> > > >>>>>>>>> what makes the most sense is to have an independent
>>>>> implementation of the
>>>>> > > >>>>>>>>> spec in Iceberg.
>>>>> > > >>>>>>>>>
>>>>> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <
>>>>> > > >>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>> > > >>>>>>>>>
>>>>> > > >>>>>>>>>> Hi Aihua,
>>>>> > > >>>>>>>>>> Long time no see :)
>>>>> > > >>>>>>>>>> Would this mean, that every engine which plans to
>>>>> support Variant
>>>>> > > >>>>>>>>>> data type needs to add Spark as a dependency? Like
>>>>> Flink/Trino/Hive etc?
>>>>> > > >>>>>>>>>> Thanks, Peter
>>>>> > > >>>>>>>>>>
>>>>> > > >>>>>>>>>>
>>>>> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <
>>>>> aihu...@apache.org> wrote:
>>>>> > > >>>>>>>>>>
>>>>> > > >>>>>>>>>>> Thanks Ryan.
>>>>> > > >>>>>>>>>>>
>>>>> > > >>>>>>>>>>> Yeah. That's another reason we want to pursue Spark
>>>>> encoding to
>>>>> > > >>>>>>>>>>> keep compatibility for the open source engines.
>>>>> > > >>>>>>>>>>>
>>>>> > > >>>>>>>>>>> One more question regarding the encoding
>>>>> implementation: do we
>>>>> > > >>>>>>>>>>> have an issue to directly use Spark implementation in
>>>>> Iceberg? Russell
>>>>> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark dependency
>>>>> and that could be a
>>>>> > > >>>>>>>>>>> problem?
>>>>> > > >>>>>>>>>>>
>>>>> > > >>>>>>>>>>> Thanks,
>>>>> > > >>>>>>>>>>> Aihua
>>>>> > > >>>>>>>>>>>
>>>>> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>>>> > > >>>>>>>>>>> > Thanks, Aihua!
>>>>> > > >>>>>>>>>>> >
>>>>> > > >>>>>>>>>>> > I think that the encoding choice in the current doc
>>>>> is a good
>>>>> > > >>>>>>>>>>> one. I went
>>>>> > > >>>>>>>>>>> > through the Spark encoding in detail and it looks
>>>>> like a
>>>>> > > >>>>>>>>>>> better choice than
>>>>> > > >>>>>>>>>>> > the other candidate encodings for quickly accessing
>>>>> nested
>>>>> > > >>>>>>>>>>> fields.
>>>>> > > >>>>>>>>>>> >
>>>>> > > >>>>>>>>>>> > Another reason to use the Spark type is that this is
>>>>> what
>>>>> > > >>>>>>>>>>> Delta's variant
>>>>> > > >>>>>>>>>>> > type is based on, so Parquet files in tables written
>>>>> by Delta
>>>>> > > >>>>>>>>>>> could be
>>>>> > > >>>>>>>>>>> > converted or used in Iceberg tables without needing
>>>>> to rewrite
>>>>> > > >>>>>>>>>>> variant
>>>>> > > >>>>>>>>>>> > data. (Also, note that I work at Databricks and have
>>>>> an
>>>>> > > >>>>>>>>>>> interest in
>>>>> > > >>>>>>>>>>> > increasing format compatibility.)
>>>>> > > >>>>>>>>>>> >
>>>>> > > >>>>>>>>>>> > Ryan
>>>>> > > >>>>>>>>>>> >
>>>>> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <
>>>>> > > >>>>>>>>>>> aihua...@snowflake.com.invalid>
>>>>> > > >>>>>>>>>>> > wrote:
>>>>> > > >>>>>>>>>>> >
>>>>> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > > It’s great to be able to present the Variant type
>>>>> proposal
>>>>> > > >>>>>>>>>>> in the
>>>>> > > >>>>>>>>>>> > > community sync yesterday and I’m looking to host a
>>>>> meeting
>>>>> > > >>>>>>>>>>> next week
>>>>> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over any
>>>>> further
>>>>> > > >>>>>>>>>>> concerns about the
>>>>> > > >>>>>>>>>>> > > encoding of the Variant type and any other
>>>>> questions on the
>>>>> > > >>>>>>>>>>> first phase of
>>>>> > > >>>>>>>>>>> > > the proposal
>>>>> > > >>>>>>>>>>> > > <
>>>>> > > >>>>>>>>>>>
>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>> > > >>>>>>>>>>> >.
>>>>> > > >>>>>>>>>>> > > We are hoping that anyone who is interested in the
>>>>> proposal
>>>>> > > >>>>>>>>>>> can either join
>>>>> > > >>>>>>>>>>> > > or reply with their comments so we can discuss
>>>>> them. Summary
>>>>> > > >>>>>>>>>>> of the
>>>>> > > >>>>>>>>>>> > > discussion and notes will be sent to the mailing
>>>>> list for
>>>>> > > >>>>>>>>>>> further comment
>>>>> > > >>>>>>>>>>> > > there.
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > >    -
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > >    What should be the underlying binary
>>>>> representation
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > > We have evaluated a few encodings in the doc
>>>>> including ION,
>>>>> > > >>>>>>>>>>> JSONB, and
>>>>> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying encoding is
>>>>> an
>>>>> > > >>>>>>>>>>> important first step
>>>>> > > >>>>>>>>>>> > > here and we believe we have general support for
>>>>> Spark’s
>>>>> > > >>>>>>>>>>> Variant encoding.
>>>>> > > >>>>>>>>>>> > > We would like to hear if anyone else has strong
>>>>> opinions in
>>>>> > > >>>>>>>>>>> this space.
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > >    -
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > >    Should we support multiple logical types or
>>>>> just Variant?
>>>>> > > >>>>>>>>>>> Variant vs.
>>>>> > > >>>>>>>>>>> > >    Variant + JSON.
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > > This is to discuss what logical data type(s) to be
>>>>> supported
>>>>> > > >>>>>>>>>>> in Iceberg -
>>>>> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both types would
>>>>> share the
>>>>> > > >>>>>>>>>>> same underlying
>>>>> > > >>>>>>>>>>> > > encoding but would imply different limitations on
>>>>> engines
>>>>> > > >>>>>>>>>>> working with
>>>>> > > >>>>>>>>>>> > > those types.
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > > From the sync up meeting, we are more favoring
>>>>> toward
>>>>> > > >>>>>>>>>>> supporting Variant
>>>>> > > >>>>>>>>>>> > > only and we want to have a consensus on the
>>>>> supported
>>>>> > > >>>>>>>>>>> type(s).
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > >    -
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > >    How should we move forward with
>>>>> Subcolumnization?
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > > Subcolumnization is an optimization for Variant
>>>>> type by
>>>>> > > >>>>>>>>>>> separating out
>>>>> > > >>>>>>>>>>> > > subcolumns with their own metadata. This is not
>>>>> critical for
>>>>> > > >>>>>>>>>>> choosing the
>>>>> > > >>>>>>>>>>> > > initial encoding of the Variant type so we were
>>>>> hoping to
>>>>> > > >>>>>>>>>>> gain consensus on
>>>>> > > >>>>>>>>>>> > > leaving that for a follow up spec.
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > > Thanks
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > > Aihua
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > > Meeting invite:
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am
>>>>> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles
>>>>> > > >>>>>>>>>>> > > Google Meet joining info
>>>>> > > >>>>>>>>>>> > > Video call link:
>>>>> https://meet.google.com/pbm-ovzn-aoq
>>>>> > > >>>>>>>>>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>>>> > > >>>>>>>>>>> > > More phone numbers:
>>>>> > > >>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <
>>>>> > > >>>>>>>>>>> aihua...@snowflake.com> wrote:
>>>>> > > >>>>>>>>>>> > >
>>>>> > > >>>>>>>>>>> > >> Hello,
>>>>> > > >>>>>>>>>>> > >>
>>>>> > > >>>>>>>>>>> > >> We have drafted the proposal
>>>>> > > >>>>>>>>>>> > >> <
>>>>> > > >>>>>>>>>>>
>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>> > > >>>>>>>>>>> >
>>>>> > > >>>>>>>>>>> > >> for Variant data type. Please help review and
>>>>> comment.
>>>>> > > >>>>>>>>>>> > >>
>>>>> > > >>>>>>>>>>> > >> Thanks,
>>>>> > > >>>>>>>>>>> > >> Aihua
>>>>> > > >>>>>>>>>>> > >>
>>>>> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <
>>>>> > > >>>>>>>>>>> yezhao...@gmail.com> wrote:
>>>>> > > >>>>>>>>>>> > >>
>>>>> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had the same
>>>>> > > >>>>>>>>>>> discussion internally
>>>>> > > >>>>>>>>>>> > >>> and a JSON type would really play well with for
>>>>> example
>>>>> > > >>>>>>>>>>> the SUPER type in
>>>>> > > >>>>>>>>>>> > >>> Redshift:
>>>>> > > >>>>>>>>>>> > >>>
>>>>> > > >>>>>>>>>>>
>>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
>>>>> > > >>>>>>>>>>> and
>>>>> > > >>>>>>>>>>> > >>> can also provide better integration with the
>>>>> Trino JSON
>>>>> > > >>>>>>>>>>> type.
>>>>> > > >>>>>>>>>>> > >>>
>>>>> > > >>>>>>>>>>> > >>> Looking forward to the proposal!
>>>>> > > >>>>>>>>>>> > >>>
>>>>> > > >>>>>>>>>>> > >>> Best,
>>>>> > > >>>>>>>>>>> > >>> Jack Ye
>>>>> > > >>>>>>>>>>> > >>>
>>>>> > > >>>>>>>>>>> > >>>
>>>>> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
>>>>> > > >>>>>>>>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>>> > > >>>>>>>>>>> > >>>
>>>>> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <
>>>>> ust...@gmail.com>
>>>>> > > >>>>>>>>>>> wrote:
>>>>> > > >>>>>>>>>>> > >>>>
>>>>> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just how many
>>>>> we need to
>>>>> > > >>>>>>>>>>> look at;
>>>>> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino, but
>>>>> weren't sure
>>>>> > > >>>>>>>>>>> how much
>>>>> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we needed to go。
>>>>> > > >>>>>>>>>>> > >>>>>
>>>>> > > >>>>>>>>>>> > >>>>> There are some engines living outside the Java
>>>>> world. It
>>>>> > > >>>>>>>>>>> would be
>>>>> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the effort it
>>>>> takes to
>>>>> > > >>>>>>>>>>> integrate
>>>>> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox, datafusion,
>>>>> etc.).
>>>>> > > >>>>>>>>>>> This is something
>>>>> > > >>>>>>>>>>> > >>>>> that
>>>>> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also care
>>>>> about.
>>>>> > > >>>>>>>>>>> > >>>>>
>>>>> > > >>>>>>>>>>> > >>>>
>>>>> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to share some
>>>>> > > >>>>>>>>>>> perspective on this.
>>>>> > > >>>>>>>>>>> > >>>>
>>>>> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as there's a
>>>>> binary type
>>>>> > > >>>>>>>>>>> and Iceberg and
>>>>> > > >>>>>>>>>>> > >>>>> > the query engines are aware that the binary
>>>>> column
>>>>> > > >>>>>>>>>>> needs to be
>>>>> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that should be
>>>>> sufficient.
>>>>> > > >>>>>>>>>>> > >>>>>
>>>>> > > >>>>>>>>>>> > >>>>> From the perspective of interoperability, it
>>>>> would be
>>>>> > > >>>>>>>>>>> good to support
>>>>> > > >>>>>>>>>>> > >>>>> native
>>>>> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be easier for
>>>>> projects
>>>>> > > >>>>>>>>>>> like Apache
>>>>> > > >>>>>>>>>>> > >>>>> XTable.
>>>>> > > >>>>>>>>>>> > >>>>> File format could also provide finer-grained
>>>>> statistics
>>>>> > > >>>>>>>>>>> for variant
>>>>> > > >>>>>>>>>>> > >>>>> type which
>>>>> > > >>>>>>>>>>> > >>>>> facilitates data skipping.
>>>>> > > >>>>>>>>>>> > >>>>>
>>>>> > > >>>>>>>>>>> > >>>>
>>>>> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be additional
>>>>> value in
>>>>> > > >>>>>>>>>>> native file format
>>>>> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight that it's
>>>>> not a
>>>>> > > >>>>>>>>>>> strict requirement.
>>>>> > > >>>>>>>>>>> > >>>>
>>>>> > > >>>>>>>>>>> > >>>> -Tyler
>>>>> > > >>>>>>>>>>> > >>>>
>>>>> > > >>>>>>>>>>> > >>>>
>>>>> > > >>>>>>>>>>> > >>>>>
>>>>> > > >>>>>>>>>>> > >>>>> Gang
>>>>> > > >>>>>>>>>>> > >>>>>
>>>>> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
>>>>> > > >>>>>>>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>>> > > >>>>>>>>>>> > >>>>>
>>>>> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB! Thanks!
>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>> > > >>>>>>>>>>> > >>>>>> -Tyler
>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste
>>>>> Onofré <
>>>>> > > >>>>>>>>>>> j...@nanthrax.net>
>>>>> > > >>>>>>>>>>> > >>>>>> wrote:
>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>> > > >>>>>>>>>>> > >>>>>>> Hi Tyler,
>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It reminds
>>>>> me our
>>>>> > > >>>>>>>>>>> discussions back in
>>>>> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :)
>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty interesting. I
>>>>> remember
>>>>> > > >>>>>>>>>>> some discussions
>>>>> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The binary
>>>>> data type
>>>>> > > >>>>>>>>>>> is already
>>>>> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2.
>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal and
>>>>> happy to help
>>>>> > > >>>>>>>>>>> on this !
>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>> > > >>>>>>>>>>> > >>>>>>> Regards
>>>>> > > >>>>>>>>>>> > >>>>>>> JB
>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>>>>> > > >>>>>>>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> > Hello,
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are
>>>>> working on a
>>>>> > > >>>>>>>>>>> proposal for
>>>>> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early feedback from
>>>>> the
>>>>> > > >>>>>>>>>>> community. As you may know,
>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its open
>>>>> Data Lake
>>>>> > > >>>>>>>>>>> format. Having made
>>>>> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of the
>>>>> Iceberg
>>>>> > > >>>>>>>>>>> standard, we’re now in a
>>>>> > > >>>>>>>>>>> > >>>>>>> position where there are features not yet
>>>>> supported in
>>>>> > > >>>>>>>>>>> Iceberg which we
>>>>> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our users, and
>>>>> that we
>>>>> > > >>>>>>>>>>> would like to discuss
>>>>> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the Iceberg
>>>>> community.
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d like to
>>>>> discuss are
>>>>> > > >>>>>>>>>>> in support of
>>>>> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically typed,
>>>>> > > >>>>>>>>>>> semi-structured data: variant data
>>>>> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of variant
>>>>> columns. In
>>>>> > > >>>>>>>>>>> more detail, for
>>>>> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be familiar:
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types
>>>>> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the efficient
>>>>> binary
>>>>> > > >>>>>>>>>>> encoding of dynamic
>>>>> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON, Avro,
>>>>> etc. By
>>>>> > > >>>>>>>>>>> encoding semi-structured
>>>>> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain the
>>>>> flexibility of
>>>>> > > >>>>>>>>>>> the source data,
>>>>> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more
>>>>> efficiently
>>>>> > > >>>>>>>>>>> operate on the data.
>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant data
>>>>> type on
>>>>> > > >>>>>>>>>>> Snowflake tables for many
>>>>> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users utilize
>>>>> Iceberg
>>>>> > > >>>>>>>>>>> tables in Snowflake,
>>>>> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus of
>>>>> requests for
>>>>> > > >>>>>>>>>>> variant support.
>>>>> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines such as
>>>>> Apache Spark
>>>>> > > >>>>>>>>>>> have begun adding
>>>>> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we believe it
>>>>> would be
>>>>> > > >>>>>>>>>>> beneficial to the
>>>>> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to standardize
>>>>> on the
>>>>> > > >>>>>>>>>>> variant data type
>>>>> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables.
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is that,
>>>>> since an
>>>>> > > >>>>>>>>>>> Apache OSS
>>>>> > > >>>>>>>>>>> > >>>>>>> version of variant encoding already exists
>>>>> in Spark,
>>>>> > > >>>>>>>>>>> it likely makes sense
>>>>> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding as the
>>>>> Iceberg
>>>>> > > >>>>>>>>>>> standard as well. The
>>>>> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in
>>>>> Snowflake is
>>>>> > > >>>>>>>>>>> slightly different, but
>>>>> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see no
>>>>> particular value
>>>>> > > >>>>>>>>>>> in trying to clutter
>>>>> > > >>>>>>>>>>> > >>>>>>> the space with another
>>>>> equivalent-but-incompatible
>>>>> > > >>>>>>>>>>> encoding.
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization
>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant columns
>>>>> allows query
>>>>> > > >>>>>>>>>>> engines to
>>>>> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when subcolumns
>>>>> (i.e.,
>>>>> > > >>>>>>>>>>> nested fields) within a
>>>>> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also allows
>>>>> optionally
>>>>> > > >>>>>>>>>>> materializing some
>>>>> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on their
>>>>> own,
>>>>> > > >>>>>>>>>>> affording queries on these
>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less data and
>>>>> spend
>>>>> > > >>>>>>>>>>> less CPU on extraction.
>>>>> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system managing
>>>>> table
>>>>> > > >>>>>>>>>>> metadata and data tracks
>>>>> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min, max,
>>>>> null, etc.)
>>>>> > > >>>>>>>>>>> for some subset of the
>>>>> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and also
>>>>> manages any
>>>>> > > >>>>>>>>>>> optional
>>>>> > > >>>>>>>>>>> > >>>>>>> materialization. Without subcolumnarization,
>>>>> any query
>>>>> > > >>>>>>>>>>> which touches a
>>>>> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse, extract,
>>>>> and filter
>>>>> > > >>>>>>>>>>> every row for which
>>>>> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by providing a
>>>>> > > >>>>>>>>>>> standardized way of tracking
>>>>> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for variant
>>>>> columns,
>>>>> > > >>>>>>>>>>> Iceberg can make
>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible across
>>>>> various
>>>>> > > >>>>>>>>>>> catalogs and query
>>>>> > > >>>>>>>>>>> > >>>>>>> engines.
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic,
>>>>> so we
>>>>> > > >>>>>>>>>>> expect any
>>>>> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not only the
>>>>> set of
>>>>> > > >>>>>>>>>>> changes to Iceberg
>>>>> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query engines
>>>>> to
>>>>> > > >>>>>>>>>>> interopate on
>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant columns,
>>>>> but also
>>>>> > > >>>>>>>>>>> reference
>>>>> > > >>>>>>>>>>> > >>>>>>> documentation explaining subcolumnarization
>>>>> principles
>>>>> > > >>>>>>>>>>> and recommended best
>>>>> > > >>>>>>>>>>> > >>>>>>> practices.
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo proposal [3]
>>>>> may be a
>>>>> > > >>>>>>>>>>> good starting
>>>>> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so our plan
>>>>> is to
>>>>> > > >>>>>>>>>>> write something up in
>>>>> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed spec
>>>>> changes,
>>>>> > > >>>>>>>>>>> backwards compatibility,
>>>>> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we wanted to
>>>>> first reach
>>>>> > > >>>>>>>>>>> out to the community
>>>>> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea, and see
>>>>> if
>>>>> > > >>>>>>>>>>> there’s any early feedback
>>>>> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we spend too
>>>>> much time on
>>>>> > > >>>>>>>>>>> a concrete proposal.
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> > Thank you!
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> > [1]
>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>> > > >>>>>>>>>>>
>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>>> > > >>>>>>>>>>> > >>>>>>> > [2]
>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>> > > >>>>>>>>>>>
>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>>> > > >>>>>>>>>>> > >>>>>>> > [3]
>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>> > > >>>>>>>>>>>
>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>> > > >>>>>>>>>>> >
>>>>> > > >>>>>>>>>>> > --
>>>>> > > >>>>>>>>>>> > Ryan Blue
>>>>> > > >>>>>>>>>>> > Databricks
>>>>> > > >>>>>>>>>>> >
>>>>> > > >>>>>>>>>>>
>>>>> > > >>>>>>>>>>
>>>>> > > >>>>>>>>>
>>>>> > > >>>>>>>>> --
>>>>> > > >>>>>>>>> Ryan Blue
>>>>> > > >>>>>>>>> Databricks
>>>>> > > >>>>>>>>>
>>>>> > > >>>>>>>>
>>>>> > > >>>>>>>>
>>>>> > > >>>>>>>> --
>>>>> > > >>>>>>>> Ryan Blue
>>>>> > > >>>>>>>> Databricks
>>>>> > > >>>>>>>>
>>>>> > > >>>>>>>
>>>>> > > >>>>>>
>>>>> > > >>>>>> --
>>>>> > > >>>>>> Ryan Blue
>>>>> > > >>>>>> Databricks
>>>>> > > >>>>>>
>>>>> > > >>>>>
>>>>> > > >>>>
>>>>> > > >>>> --
>>>>> > > >>>> Ryan Blue
>>>>> > > >>>> Databricks
>>>>> > > >>>>
>>>>> > > >>>
>>>>> > > >>
>>>>> > > >> --
>>>>> > > >> Ryan Blue
>>>>> > > >> Databricks
>>>>> > > >>
>>>>> > > >
>>>>> > >
>>>>> >
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to