Sorry for the late reply. I agree with the sentiments on 1 and 3 that have already been posted (adopt the Spark encoding, and only have the Variant type). As mentioned on the doc for 3, I think it would be good to specify how to map scalar types to a JSON representation so there can be consistency between engines that don't support variant.
> Regarding point 2, I also feel Iceberg is more natural to host such a > subproject for variant spec and implementation. But let me reach out to the > Spark community to discuss. The only other place I can think of that might be a good home for Variant spec could be in Apache Arrow as a canonical extension type. There is an issue for this [1]. I think the main thing on where this is housed is which types are intended to be supported. I believe Arrow is currently a superset of the Iceberg type system (UUID is supported as a canonical extension type [2]). For point 4 subcolumnarization, I think ideally this belongs in Iceberg (and if Iceberg and Delta Lake can agree on how to do it that would be great) with potential consultation with Parquet/ORC communities to potentially add better native support. Thanks, Micah [1] https://github.com/apache/arrow/issues/42069 [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html On Sat, Jul 20, 2024 at 5:54 PM Aihua Xu <aihu...@gmail.com> wrote: > Thanks for the discussion and feedback. > > Do we have the consensus on point 1 and point 3 to move forward with Spark > variant encoding and support Variant type only? Or let me know how to > proceed from here. > > Regarding point 2, I also feel Iceberg is more natural to host such a > subproject for variant spec and implementation. But let me reach out to the > Spark community to discuss. > > Thanks, > Aihua > > > On Fri, Jul 19, 2024 at 9:35 AM Yufei Gu <flyrain...@gmail.com> wrote: > >> Agreed with point 1. >> >> For point 2, I also prefer to hold the spec and reference implementation >> under Iceberg. Here are the reasons: >> 1. It is unconventional and impractical for one engine to depend on >> another for data types. For instance, it is not ideal for Trino to rely on >> data types defined by the Spark engine. >> 2. Iceberg serves as a bridge between engines and file formats. By >> centralizing the specification in Iceberg, any future optimizations or >> updates to file formats can be referred to within Iceberg, ensuring >> consistency and reducing dependencies. >> >> For point 3, I'd prefer to support the variant type only at this moment. >> >> Yufei >> >> >> On Thu, Jul 18, 2024 at 12:55 PM Ryan Blue <b...@databricks.com.invalid> >> wrote: >> >>> Similarly, I'm aligned with point 1 and I'd choose to support only >>> variant for point 3. >>> >>> We'll need to work with the Spark community to find a good place for the >>> library and spec, since it touches many different projects. I'd also prefer >>> Iceberg as the home. >>> >>> I also think it's a good idea to get subcolumnarization into our spec >>> when we update. Without that I think the feature will be fairly limited. >>> >>> On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> I'm aligned with point 1. >>>> >>>> For point 2 I think we should choose quickly, I honestly do think this >>>> would be fine as part of the Iceberg Spec directly but understand it may be >>>> better for the more broad community if it was a sub project. As a >>>> sub-project I would still prefer it being an Iceberg Subproject since we >>>> are engine/file-format agnostic. >>>> >>>> 3. I support adding just Variant. >>>> >>>> On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <aihu...@apache.org> wrote: >>>> >>>>> Hello community, >>>>> >>>>> It’s great to sync up with some of you on Variant and >>>>> SubColumarization support in Iceberg again. Apologize that I didn’t record >>>>> the meeting but here are some key items that we want to follow up with the >>>>> community. >>>>> >>>>> 1. Adopt Spark Variant encoding >>>>> Those present were in favor of adopting the Spark variant encoding >>>>> for Iceberg Variant with extensions to support other Iceberg types. We >>>>> would like to know if anyone has an objection to this to reuse an open >>>>> source encoding. >>>>> >>>>> 2. Movement of the Spark Variant Spec to another project >>>>> To avoid introducing Apache Spark as a dependency for the engines and >>>>> file formats, we discussed separating Spark Variant encoding spec and >>>>> implementation from the Spark Project to a neutral location. We thought up >>>>> several solutions but didn’t have consensus on any of them. We are looking >>>>> for more feedback on this topic from the community either in terms of >>>>> support for one of these options or another idea on how to support the >>>>> spec. >>>>> >>>>> Options Proposed: >>>>> * Leave the Spec in Spark (Difficult for versioning and other engines) >>>>> * Copying the Spec into Iceberg Project Directly (Difficult for other >>>>> Table Formats) >>>>> * Creating a Sub-Project of Apache Iceberg and moving the spec and >>>>> reference implementation there (Logistically complicated) >>>>> * Creating a Sub-Project of Apache Spark and moving the spec and >>>>> reference implementation there (Logistically complicated) >>>>> >>>>> 3. Add Variant type vs. Variant and JSON types >>>>> Those who were present were in favor of adding only the Variant type >>>>> to Iceberg. We are looking for anyone who has an objection to going >>>>> forward >>>>> with just the Variant Type and no Iceberg JSON Type. We were favoring >>>>> adding Variant type only because: >>>>> * Introducing a JSON type would require engines that only support >>>>> VARIANT to do write time validation of their input to a JSON column. If >>>>> they don’t have a JSON type an engine wouldn’t support this. >>>>> * Engines which don’t support Variant will work most of the time but >>>>> can have fallback strings defined in the spec for reading unsupported >>>>> types. Writing a JSON into a Variant will always work. >>>>> >>>>> 4. Support for Subcolumnization spec (shredding in Spark) >>>>> We have no action items on this but would like to follow up on >>>>> discussions on Subcolumnization in the future. >>>>> * We had general agreement that this should be included in Iceberg V3 >>>>> or else adding variant may not be useful. >>>>> * We are interested in also adopting the shredding spec from Spark and >>>>> would like to move it to whatever place we decided the Variant spec is >>>>> going to live. >>>>> >>>>> Let us know if missed anything and if you have any additional thoughts >>>>> or suggestions. >>>>> >>>>> Thanks >>>>> Aihua >>>>> >>>>> >>>>> On 2024/07/15 18:32:22 Aihua Xu wrote: >>>>> > Thanks for the discussion. >>>>> > >>>>> > I will move forward to work on spec PR. >>>>> > >>>>> > Regarding the implementation, we will have module for Variant >>>>> support in Iceberg so we will not have to bring in Spark libraries. >>>>> > >>>>> > I'm reposting the meeting invite in case it's not clear in my >>>>> original email since I included in the end. Looks like we don't have major >>>>> objections/diverges but let's sync up and have consensus. >>>>> > >>>>> > Meeting invite: >>>>> > >>>>> > Wednesday, July 17 · 9:00 – 10:00am >>>>> > Time zone: America/Los_Angeles >>>>> > Google Meet joining info >>>>> > Video call link: https://meet.google.com/pbm-ovzn-aoq >>>>> > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >>>>> > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>>> > >>>>> > Thanks, >>>>> > Aihua >>>>> > >>>>> > On 2024/07/12 20:55:01 Micah Kornfield wrote: >>>>> > > I don't think this needs to hold up the PR but I think coming to a >>>>> > > consensus on the exact set of types supported is worthwhile (and >>>>> if the >>>>> > > goal is to maintain the same set as specified by the Spark Variant >>>>> type or >>>>> > > if divergence is expected/allowed). From a fragmentation >>>>> perspective it >>>>> > > would be a shame if they diverge, so maybe a next step is also >>>>> suggesting >>>>> > > support to the Spark community on the missing existing Iceberg >>>>> types? >>>>> > > >>>>> > > Thanks, >>>>> > > Micah >>>>> > > >>>>> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer < >>>>> russell.spit...@gmail.com> >>>>> > > wrote: >>>>> > > >>>>> > > > Just talked with Aihua and he's working on the Spec PR now. We >>>>> can get >>>>> > > > feedback there from everyone. >>>>> > > > >>>>> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue >>>>> <b...@databricks.com.invalid> >>>>> > > > wrote: >>>>> > > > >>>>> > > >> Good idea, but I'm hoping that we can continue to get their >>>>> feedback in >>>>> > > >> parallel to getting the spec changes started. Piotr didn't seem >>>>> to object >>>>> > > >> to the encoding from what I read of his comments. Hopefully he >>>>> (and others) >>>>> > > >> chime in here. >>>>> > > >> >>>>> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer < >>>>> > > >> russell.spit...@gmail.com> wrote: >>>>> > > >> >>>>> > > >>> I just want to make sure we get Piotr and Peter on board as >>>>> > > >>> representatives of Flink and Trino engines. Also make sure we >>>>> have anyone >>>>> > > >>> else chime in who has experience with Ray if possible. >>>>> > > >>> >>>>> > > >>> Spec changes feel like the right next step. >>>>> > > >>> >>>>> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue >>>>> <b...@databricks.com.invalid> >>>>> > > >>> wrote: >>>>> > > >>> >>>>> > > >>>> Okay, what are the next steps here? This proposal has been >>>>> out for >>>>> > > >>>> quite a while and I don't see any major objections to using >>>>> the Spark >>>>> > > >>>> encoding. It's quite well designed and fits the need well. It >>>>> can also be >>>>> > > >>>> extended to support additional types that are missing if >>>>> that's a priority. >>>>> > > >>>> >>>>> > > >>>> Should we move forward by starting a draft of the changes to >>>>> the table >>>>> > > >>>> spec? Then we can vote on committing those changes and get >>>>> moving on an >>>>> > > >>>> implementation (or possibly do the implementation in >>>>> parallel). >>>>> > > >>>> >>>>> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer < >>>>> > > >>>> russell.spit...@gmail.com> wrote: >>>>> > > >>>> >>>>> > > >>>>> That's fair, I'm sold on an Iceberg Module. >>>>> > > >>>>> >>>>> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue >>>>> <b...@databricks.com.invalid> >>>>> > > >>>>> wrote: >>>>> > > >>>>> >>>>> > > >>>>>> > Feels like eventually the encoding should land in parquet >>>>> proper >>>>> > > >>>>>> right? >>>>> > > >>>>>> >>>>> > > >>>>>> What about using it in ORC? I don't know where it should >>>>> end up. >>>>> > > >>>>>> Maybe Iceberg should make a standalone module from it? >>>>> > > >>>>>> >>>>> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer < >>>>> > > >>>>>> russell.spit...@gmail.com> wrote: >>>>> > > >>>>>> >>>>> > > >>>>>>> Feels like eventually the encoding should land in parquet >>>>> proper >>>>> > > >>>>>>> right? I'm fine with us just copying into Iceberg though >>>>> for the time >>>>> > > >>>>>>> being. >>>>> > > >>>>>>> >>>>> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue >>>>> > > >>>>>>> <b...@databricks.com.invalid> wrote: >>>>> > > >>>>>>> >>>>> > > >>>>>>>> Oops, it looks like I missed where Aihua brought this up >>>>> in his >>>>> > > >>>>>>>> last email: >>>>> > > >>>>>>>> >>>>> > > >>>>>>>> > do we have an issue to directly use Spark >>>>> implementation in >>>>> > > >>>>>>>> Iceberg? >>>>> > > >>>>>>>> >>>>> > > >>>>>>>> Yes, I think that we do have an issue using the Spark >>>>> library. What >>>>> > > >>>>>>>> do you think about a Java implementation in Iceberg? >>>>> > > >>>>>>>> >>>>> > > >>>>>>>> Ryan >>>>> > > >>>>>>>> >>>>> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue < >>>>> b...@databricks.com> >>>>> > > >>>>>>>> wrote: >>>>> > > >>>>>>>> >>>>> > > >>>>>>>>> I raised the same point from Peter's email in a comment >>>>> on the doc >>>>> > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact that >>>>> would be a much >>>>> > > >>>>>>>>> smaller scope than relying on large portions of Spark, >>>>> but I even then I >>>>> > > >>>>>>>>> doubt that it is a good idea for Iceberg to depend on >>>>> that because it is a >>>>> > > >>>>>>>>> Scala artifact and we would need to bring in a ton of >>>>> Scala libs. I think >>>>> > > >>>>>>>>> what makes the most sense is to have an independent >>>>> implementation of the >>>>> > > >>>>>>>>> spec in Iceberg. >>>>> > > >>>>>>>>> >>>>> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry < >>>>> > > >>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>> > > >>>>>>>>> >>>>> > > >>>>>>>>>> Hi Aihua, >>>>> > > >>>>>>>>>> Long time no see :) >>>>> > > >>>>>>>>>> Would this mean, that every engine which plans to >>>>> support Variant >>>>> > > >>>>>>>>>> data type needs to add Spark as a dependency? Like >>>>> Flink/Trino/Hive etc? >>>>> > > >>>>>>>>>> Thanks, Peter >>>>> > > >>>>>>>>>> >>>>> > > >>>>>>>>>> >>>>> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu < >>>>> aihu...@apache.org> wrote: >>>>> > > >>>>>>>>>> >>>>> > > >>>>>>>>>>> Thanks Ryan. >>>>> > > >>>>>>>>>>> >>>>> > > >>>>>>>>>>> Yeah. That's another reason we want to pursue Spark >>>>> encoding to >>>>> > > >>>>>>>>>>> keep compatibility for the open source engines. >>>>> > > >>>>>>>>>>> >>>>> > > >>>>>>>>>>> One more question regarding the encoding >>>>> implementation: do we >>>>> > > >>>>>>>>>>> have an issue to directly use Spark implementation in >>>>> Iceberg? Russell >>>>> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark dependency >>>>> and that could be a >>>>> > > >>>>>>>>>>> problem? >>>>> > > >>>>>>>>>>> >>>>> > > >>>>>>>>>>> Thanks, >>>>> > > >>>>>>>>>>> Aihua >>>>> > > >>>>>>>>>>> >>>>> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote: >>>>> > > >>>>>>>>>>> > Thanks, Aihua! >>>>> > > >>>>>>>>>>> > >>>>> > > >>>>>>>>>>> > I think that the encoding choice in the current doc >>>>> is a good >>>>> > > >>>>>>>>>>> one. I went >>>>> > > >>>>>>>>>>> > through the Spark encoding in detail and it looks >>>>> like a >>>>> > > >>>>>>>>>>> better choice than >>>>> > > >>>>>>>>>>> > the other candidate encodings for quickly accessing >>>>> nested >>>>> > > >>>>>>>>>>> fields. >>>>> > > >>>>>>>>>>> > >>>>> > > >>>>>>>>>>> > Another reason to use the Spark type is that this is >>>>> what >>>>> > > >>>>>>>>>>> Delta's variant >>>>> > > >>>>>>>>>>> > type is based on, so Parquet files in tables written >>>>> by Delta >>>>> > > >>>>>>>>>>> could be >>>>> > > >>>>>>>>>>> > converted or used in Iceberg tables without needing >>>>> to rewrite >>>>> > > >>>>>>>>>>> variant >>>>> > > >>>>>>>>>>> > data. (Also, note that I work at Databricks and have >>>>> an >>>>> > > >>>>>>>>>>> interest in >>>>> > > >>>>>>>>>>> > increasing format compatibility.) >>>>> > > >>>>>>>>>>> > >>>>> > > >>>>>>>>>>> > Ryan >>>>> > > >>>>>>>>>>> > >>>>> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu < >>>>> > > >>>>>>>>>>> aihua...@snowflake.com.invalid> >>>>> > > >>>>>>>>>>> > wrote: >>>>> > > >>>>>>>>>>> > >>>>> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > It’s great to be able to present the Variant type >>>>> proposal >>>>> > > >>>>>>>>>>> in the >>>>> > > >>>>>>>>>>> > > community sync yesterday and I’m looking to host a >>>>> meeting >>>>> > > >>>>>>>>>>> next week >>>>> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over any >>>>> further >>>>> > > >>>>>>>>>>> concerns about the >>>>> > > >>>>>>>>>>> > > encoding of the Variant type and any other >>>>> questions on the >>>>> > > >>>>>>>>>>> first phase of >>>>> > > >>>>>>>>>>> > > the proposal >>>>> > > >>>>>>>>>>> > > < >>>>> > > >>>>>>>>>>> >>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>>> > > >>>>>>>>>>> >. >>>>> > > >>>>>>>>>>> > > We are hoping that anyone who is interested in the >>>>> proposal >>>>> > > >>>>>>>>>>> can either join >>>>> > > >>>>>>>>>>> > > or reply with their comments so we can discuss >>>>> them. Summary >>>>> > > >>>>>>>>>>> of the >>>>> > > >>>>>>>>>>> > > discussion and notes will be sent to the mailing >>>>> list for >>>>> > > >>>>>>>>>>> further comment >>>>> > > >>>>>>>>>>> > > there. >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > - >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > What should be the underlying binary >>>>> representation >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > We have evaluated a few encodings in the doc >>>>> including ION, >>>>> > > >>>>>>>>>>> JSONB, and >>>>> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying encoding is >>>>> an >>>>> > > >>>>>>>>>>> important first step >>>>> > > >>>>>>>>>>> > > here and we believe we have general support for >>>>> Spark’s >>>>> > > >>>>>>>>>>> Variant encoding. >>>>> > > >>>>>>>>>>> > > We would like to hear if anyone else has strong >>>>> opinions in >>>>> > > >>>>>>>>>>> this space. >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > - >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > Should we support multiple logical types or >>>>> just Variant? >>>>> > > >>>>>>>>>>> Variant vs. >>>>> > > >>>>>>>>>>> > > Variant + JSON. >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > This is to discuss what logical data type(s) to be >>>>> supported >>>>> > > >>>>>>>>>>> in Iceberg - >>>>> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both types would >>>>> share the >>>>> > > >>>>>>>>>>> same underlying >>>>> > > >>>>>>>>>>> > > encoding but would imply different limitations on >>>>> engines >>>>> > > >>>>>>>>>>> working with >>>>> > > >>>>>>>>>>> > > those types. >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > From the sync up meeting, we are more favoring >>>>> toward >>>>> > > >>>>>>>>>>> supporting Variant >>>>> > > >>>>>>>>>>> > > only and we want to have a consensus on the >>>>> supported >>>>> > > >>>>>>>>>>> type(s). >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > - >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > How should we move forward with >>>>> Subcolumnization? >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > Subcolumnization is an optimization for Variant >>>>> type by >>>>> > > >>>>>>>>>>> separating out >>>>> > > >>>>>>>>>>> > > subcolumns with their own metadata. This is not >>>>> critical for >>>>> > > >>>>>>>>>>> choosing the >>>>> > > >>>>>>>>>>> > > initial encoding of the Variant type so we were >>>>> hoping to >>>>> > > >>>>>>>>>>> gain consensus on >>>>> > > >>>>>>>>>>> > > leaving that for a follow up spec. >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > Thanks >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > Aihua >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > Meeting invite: >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am >>>>> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles >>>>> > > >>>>>>>>>>> > > Google Meet joining info >>>>> > > >>>>>>>>>>> > > Video call link: >>>>> https://meet.google.com/pbm-ovzn-aoq >>>>> > > >>>>>>>>>>> > > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >>>>> > > >>>>>>>>>>> > > More phone numbers: >>>>> > > >>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu < >>>>> > > >>>>>>>>>>> aihua...@snowflake.com> wrote: >>>>> > > >>>>>>>>>>> > > >>>>> > > >>>>>>>>>>> > >> Hello, >>>>> > > >>>>>>>>>>> > >> >>>>> > > >>>>>>>>>>> > >> We have drafted the proposal >>>>> > > >>>>>>>>>>> > >> < >>>>> > > >>>>>>>>>>> >>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>>> > > >>>>>>>>>>> > >>>>> > > >>>>>>>>>>> > >> for Variant data type. Please help review and >>>>> comment. >>>>> > > >>>>>>>>>>> > >> >>>>> > > >>>>>>>>>>> > >> Thanks, >>>>> > > >>>>>>>>>>> > >> Aihua >>>>> > > >>>>>>>>>>> > >> >>>>> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye < >>>>> > > >>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>> > > >>>>>>>>>>> > >> >>>>> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had the same >>>>> > > >>>>>>>>>>> discussion internally >>>>> > > >>>>>>>>>>> > >>> and a JSON type would really play well with for >>>>> example >>>>> > > >>>>>>>>>>> the SUPER type in >>>>> > > >>>>>>>>>>> > >>> Redshift: >>>>> > > >>>>>>>>>>> > >>> >>>>> > > >>>>>>>>>>> >>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, >>>>> > > >>>>>>>>>>> and >>>>> > > >>>>>>>>>>> > >>> can also provide better integration with the >>>>> Trino JSON >>>>> > > >>>>>>>>>>> type. >>>>> > > >>>>>>>>>>> > >>> >>>>> > > >>>>>>>>>>> > >>> Looking forward to the proposal! >>>>> > > >>>>>>>>>>> > >>> >>>>> > > >>>>>>>>>>> > >>> Best, >>>>> > > >>>>>>>>>>> > >>> Jack Ye >>>>> > > >>>>>>>>>>> > >>> >>>>> > > >>>>>>>>>>> > >>> >>>>> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau >>>>> > > >>>>>>>>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>> > > >>>>>>>>>>> > >>> >>>>> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu < >>>>> ust...@gmail.com> >>>>> > > >>>>>>>>>>> wrote: >>>>> > > >>>>>>>>>>> > >>>> >>>>> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just how many >>>>> we need to >>>>> > > >>>>>>>>>>> look at; >>>>> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino, but >>>>> weren't sure >>>>> > > >>>>>>>>>>> how much >>>>> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we needed to go。 >>>>> > > >>>>>>>>>>> > >>>>> >>>>> > > >>>>>>>>>>> > >>>>> There are some engines living outside the Java >>>>> world. It >>>>> > > >>>>>>>>>>> would be >>>>> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the effort it >>>>> takes to >>>>> > > >>>>>>>>>>> integrate >>>>> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox, datafusion, >>>>> etc.). >>>>> > > >>>>>>>>>>> This is something >>>>> > > >>>>>>>>>>> > >>>>> that >>>>> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also care >>>>> about. >>>>> > > >>>>>>>>>>> > >>>>> >>>>> > > >>>>>>>>>>> > >>>> >>>>> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to share some >>>>> > > >>>>>>>>>>> perspective on this. >>>>> > > >>>>>>>>>>> > >>>> >>>>> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as there's a >>>>> binary type >>>>> > > >>>>>>>>>>> and Iceberg and >>>>> > > >>>>>>>>>>> > >>>>> > the query engines are aware that the binary >>>>> column >>>>> > > >>>>>>>>>>> needs to be >>>>> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that should be >>>>> sufficient. >>>>> > > >>>>>>>>>>> > >>>>> >>>>> > > >>>>>>>>>>> > >>>>> From the perspective of interoperability, it >>>>> would be >>>>> > > >>>>>>>>>>> good to support >>>>> > > >>>>>>>>>>> > >>>>> native >>>>> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be easier for >>>>> projects >>>>> > > >>>>>>>>>>> like Apache >>>>> > > >>>>>>>>>>> > >>>>> XTable. >>>>> > > >>>>>>>>>>> > >>>>> File format could also provide finer-grained >>>>> statistics >>>>> > > >>>>>>>>>>> for variant >>>>> > > >>>>>>>>>>> > >>>>> type which >>>>> > > >>>>>>>>>>> > >>>>> facilitates data skipping. >>>>> > > >>>>>>>>>>> > >>>>> >>>>> > > >>>>>>>>>>> > >>>> >>>>> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be additional >>>>> value in >>>>> > > >>>>>>>>>>> native file format >>>>> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight that it's >>>>> not a >>>>> > > >>>>>>>>>>> strict requirement. >>>>> > > >>>>>>>>>>> > >>>> >>>>> > > >>>>>>>>>>> > >>>> -Tyler >>>>> > > >>>>>>>>>>> > >>>> >>>>> > > >>>>>>>>>>> > >>>> >>>>> > > >>>>>>>>>>> > >>>>> >>>>> > > >>>>>>>>>>> > >>>>> Gang >>>>> > > >>>>>>>>>>> > >>>>> >>>>> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau >>>>> > > >>>>>>>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>> > > >>>>>>>>>>> > >>>>> >>>>> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB! Thanks! >>>>> > > >>>>>>>>>>> > >>>>>> >>>>> > > >>>>>>>>>>> > >>>>>> -Tyler >>>>> > > >>>>>>>>>>> > >>>>>> >>>>> > > >>>>>>>>>>> > >>>>>> >>>>> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste >>>>> Onofré < >>>>> > > >>>>>>>>>>> j...@nanthrax.net> >>>>> > > >>>>>>>>>>> > >>>>>> wrote: >>>>> > > >>>>>>>>>>> > >>>>>> >>>>> > > >>>>>>>>>>> > >>>>>>> Hi Tyler, >>>>> > > >>>>>>>>>>> > >>>>>>> >>>>> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It reminds >>>>> me our >>>>> > > >>>>>>>>>>> discussions back in >>>>> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :) >>>>> > > >>>>>>>>>>> > >>>>>>> >>>>> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty interesting. I >>>>> remember >>>>> > > >>>>>>>>>>> some discussions >>>>> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The binary >>>>> data type >>>>> > > >>>>>>>>>>> is already >>>>> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2. >>>>> > > >>>>>>>>>>> > >>>>>>> >>>>> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal and >>>>> happy to help >>>>> > > >>>>>>>>>>> on this ! >>>>> > > >>>>>>>>>>> > >>>>>>> >>>>> > > >>>>>>>>>>> > >>>>>>> Regards >>>>> > > >>>>>>>>>>> > >>>>>>> JB >>>>> > > >>>>>>>>>>> > >>>>>>> >>>>> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau >>>>> > > >>>>>>>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > Hello, >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are >>>>> working on a >>>>> > > >>>>>>>>>>> proposal for >>>>> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early feedback from >>>>> the >>>>> > > >>>>>>>>>>> community. As you may know, >>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its open >>>>> Data Lake >>>>> > > >>>>>>>>>>> format. Having made >>>>> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of the >>>>> Iceberg >>>>> > > >>>>>>>>>>> standard, we’re now in a >>>>> > > >>>>>>>>>>> > >>>>>>> position where there are features not yet >>>>> supported in >>>>> > > >>>>>>>>>>> Iceberg which we >>>>> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our users, and >>>>> that we >>>>> > > >>>>>>>>>>> would like to discuss >>>>> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the Iceberg >>>>> community. >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d like to >>>>> discuss are >>>>> > > >>>>>>>>>>> in support of >>>>> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically typed, >>>>> > > >>>>>>>>>>> semi-structured data: variant data >>>>> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of variant >>>>> columns. In >>>>> > > >>>>>>>>>>> more detail, for >>>>> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be familiar: >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types >>>>> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the efficient >>>>> binary >>>>> > > >>>>>>>>>>> encoding of dynamic >>>>> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON, Avro, >>>>> etc. By >>>>> > > >>>>>>>>>>> encoding semi-structured >>>>> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain the >>>>> flexibility of >>>>> > > >>>>>>>>>>> the source data, >>>>> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more >>>>> efficiently >>>>> > > >>>>>>>>>>> operate on the data. >>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant data >>>>> type on >>>>> > > >>>>>>>>>>> Snowflake tables for many >>>>> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users utilize >>>>> Iceberg >>>>> > > >>>>>>>>>>> tables in Snowflake, >>>>> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus of >>>>> requests for >>>>> > > >>>>>>>>>>> variant support. >>>>> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines such as >>>>> Apache Spark >>>>> > > >>>>>>>>>>> have begun adding >>>>> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we believe it >>>>> would be >>>>> > > >>>>>>>>>>> beneficial to the >>>>> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to standardize >>>>> on the >>>>> > > >>>>>>>>>>> variant data type >>>>> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables. >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is that, >>>>> since an >>>>> > > >>>>>>>>>>> Apache OSS >>>>> > > >>>>>>>>>>> > >>>>>>> version of variant encoding already exists >>>>> in Spark, >>>>> > > >>>>>>>>>>> it likely makes sense >>>>> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding as the >>>>> Iceberg >>>>> > > >>>>>>>>>>> standard as well. The >>>>> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in >>>>> Snowflake is >>>>> > > >>>>>>>>>>> slightly different, but >>>>> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see no >>>>> particular value >>>>> > > >>>>>>>>>>> in trying to clutter >>>>> > > >>>>>>>>>>> > >>>>>>> the space with another >>>>> equivalent-but-incompatible >>>>> > > >>>>>>>>>>> encoding. >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization >>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant columns >>>>> allows query >>>>> > > >>>>>>>>>>> engines to >>>>> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when subcolumns >>>>> (i.e., >>>>> > > >>>>>>>>>>> nested fields) within a >>>>> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also allows >>>>> optionally >>>>> > > >>>>>>>>>>> materializing some >>>>> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on their >>>>> own, >>>>> > > >>>>>>>>>>> affording queries on these >>>>> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less data and >>>>> spend >>>>> > > >>>>>>>>>>> less CPU on extraction. >>>>> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system managing >>>>> table >>>>> > > >>>>>>>>>>> metadata and data tracks >>>>> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min, max, >>>>> null, etc.) >>>>> > > >>>>>>>>>>> for some subset of the >>>>> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and also >>>>> manages any >>>>> > > >>>>>>>>>>> optional >>>>> > > >>>>>>>>>>> > >>>>>>> materialization. Without subcolumnarization, >>>>> any query >>>>> > > >>>>>>>>>>> which touches a >>>>> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse, extract, >>>>> and filter >>>>> > > >>>>>>>>>>> every row for which >>>>> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by providing a >>>>> > > >>>>>>>>>>> standardized way of tracking >>>>> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for variant >>>>> columns, >>>>> > > >>>>>>>>>>> Iceberg can make >>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible across >>>>> various >>>>> > > >>>>>>>>>>> catalogs and query >>>>> > > >>>>>>>>>>> > >>>>>>> engines. >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, >>>>> so we >>>>> > > >>>>>>>>>>> expect any >>>>> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not only the >>>>> set of >>>>> > > >>>>>>>>>>> changes to Iceberg >>>>> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query engines >>>>> to >>>>> > > >>>>>>>>>>> interopate on >>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant columns, >>>>> but also >>>>> > > >>>>>>>>>>> reference >>>>> > > >>>>>>>>>>> > >>>>>>> documentation explaining subcolumnarization >>>>> principles >>>>> > > >>>>>>>>>>> and recommended best >>>>> > > >>>>>>>>>>> > >>>>>>> practices. >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo proposal [3] >>>>> may be a >>>>> > > >>>>>>>>>>> good starting >>>>> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so our plan >>>>> is to >>>>> > > >>>>>>>>>>> write something up in >>>>> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed spec >>>>> changes, >>>>> > > >>>>>>>>>>> backwards compatibility, >>>>> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we wanted to >>>>> first reach >>>>> > > >>>>>>>>>>> out to the community >>>>> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea, and see >>>>> if >>>>> > > >>>>>>>>>>> there’s any early feedback >>>>> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we spend too >>>>> much time on >>>>> > > >>>>>>>>>>> a concrete proposal. >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > Thank you! >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > [1] >>>>> > > >>>>>>>>>>> > >>>>>>> >>>>> > > >>>>>>>>>>> >>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>>>> > > >>>>>>>>>>> > >>>>>>> > [2] >>>>> > > >>>>>>>>>>> > >>>>>>> >>>>> > > >>>>>>>>>>> >>>>> https://github.com/apache/spark/blob/master/common/variant/README.md >>>>> > > >>>>>>>>>>> > >>>>>>> > [3] >>>>> > > >>>>>>>>>>> > >>>>>>> >>>>> > > >>>>>>>>>>> >>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua >>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>> > > >>>>>>>>>>> > >>>>>>> >>>>> > > >>>>>>>>>>> > >>>>>> >>>>> > > >>>>>>>>>>> > >>>>> > > >>>>>>>>>>> > -- >>>>> > > >>>>>>>>>>> > Ryan Blue >>>>> > > >>>>>>>>>>> > Databricks >>>>> > > >>>>>>>>>>> > >>>>> > > >>>>>>>>>>> >>>>> > > >>>>>>>>>> >>>>> > > >>>>>>>>> >>>>> > > >>>>>>>>> -- >>>>> > > >>>>>>>>> Ryan Blue >>>>> > > >>>>>>>>> Databricks >>>>> > > >>>>>>>>> >>>>> > > >>>>>>>> >>>>> > > >>>>>>>> >>>>> > > >>>>>>>> -- >>>>> > > >>>>>>>> Ryan Blue >>>>> > > >>>>>>>> Databricks >>>>> > > >>>>>>>> >>>>> > > >>>>>>> >>>>> > > >>>>>> >>>>> > > >>>>>> -- >>>>> > > >>>>>> Ryan Blue >>>>> > > >>>>>> Databricks >>>>> > > >>>>>> >>>>> > > >>>>> >>>>> > > >>>> >>>>> > > >>>> -- >>>>> > > >>>> Ryan Blue >>>>> > > >>>> Databricks >>>>> > > >>>> >>>>> > > >>> >>>>> > > >> >>>>> > > >> -- >>>>> > > >> Ryan Blue >>>>> > > >> Databricks >>>>> > > >> >>>>> > > > >>>>> > > >>>>> > >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Databricks >>> >>