+10000 for a JSON/BSON type. We also had the same discussion internally and a JSON type would really play well with for example the SUPER type in Redshift: https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, and can also provide better integration with the Trino JSON type.
Looking forward to the proposal! Best, Jack Ye On Wed, May 15, 2024 at 9:37 AM Tyler Akidau <tyler.aki...@snowflake.com.invalid> wrote: > On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> wrote: > >> > We may need some guidance on just how many we need to look at; >> > we were planning on Spark and Trino, but weren't sure how much >> > further down the rabbit hole we needed to go。 >> >> There are some engines living outside the Java world. It would be >> good if the proposal could cover the effort it takes to integrate >> variant type to them (e.g. velox, datafusion, etc.). This is something >> that >> some proprietary iceberg vendors also care about. >> > > Ack, makes sense. We can make sure to share some perspective on this. > > > Not necessarily, no. As long as there's a binary type and Iceberg and >> > the query engines are aware that the binary column needs to be >> > interpreted as a variant, that should be sufficient. >> >> From the perspective of interoperability, it would be good to support >> native >> type from file specs. Life will be easier for projects like Apache >> XTable. >> File format could also provide finer-grained statistics for variant type >> which >> facilitates data skipping. >> > > Agreed, there can definitely be additional value in native file format > integration. Just wanted to highlight that it's not a strict requirement. > > -Tyler > > >> >> Gang >> >> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau >> <tyler.aki...@snowflake.com.invalid> wrote: >> >>> Good to see you again as well, JB! Thanks! >>> >>> -Tyler >>> >>> >>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <j...@nanthrax.net> >>> wrote: >>> >>>> Hi Tyler, >>>> >>>> Super happy to see you there :) It reminds me our discussions back in >>>> the start of Apache Beam :) >>>> >>>> Anyway, the thread is pretty interesting. I remember some discussions >>>> about JSON datatype for spec v3. The binary data type is already >>>> supported in the spec v2. >>>> >>>> I'm looking forward to the proposal and happy to help on this ! >>>> >>>> Regards >>>> JB >>>> >>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau >>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>> > >>>> > Hello, >>>> > >>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for >>>> which we’d like to get early feedback from the community. As you may know, >>>> Snowflake has embraced Iceberg as its open Data Lake format. Having made >>>> good progress on our own adoption of the Iceberg standard, we’re now in a >>>> position where there are features not yet supported in Iceberg which we >>>> think would be valuable for our users, and that we would like to discuss >>>> with and help contribute to the Iceberg community. >>>> > >>>> > The first two such features we’d like to discuss are in support of >>>> efficient querying of dynamically typed, semi-structured data: variant data >>>> types, and subcolumnarization of variant columns. In more detail, for >>>> anyone who may not already be familiar: >>>> > >>>> > 1. Variant data types >>>> > Variant types allow for the efficient binary encoding of dynamic >>>> semi-structured data such as JSON, Avro, etc. By encoding semi-structured >>>> data as a variant column, we retain the flexibility of the source data, >>>> while allowing query engines to more efficiently operate on the data. >>>> Snowflake has supported the variant data type on Snowflake tables for many >>>> years [1]. As more and more users utilize Iceberg tables in Snowflake, >>>> we’re hearing an increasing chorus of requests for variant support. >>>> Additionally, other query engines such as Apache Spark have begun adding >>>> variant support [2]. As such, we believe it would be beneficial to the >>>> Iceberg community as a whole to standardize on the variant data type >>>> encoding used across Iceberg tables. >>>> > >>>> > One specific point to make here is that, since an Apache OSS version >>>> of variant encoding already exists in Spark, it likely makes sense to >>>> simply adopt the Spark encoding as the Iceberg standard as well. The >>>> encoding we use internally today in Snowflake is slightly different, but >>>> essentially equivalent, and we see no particular value in trying to clutter >>>> the space with another equivalent-but-incompatible encoding. >>>> > >>>> > >>>> > 2. Subcolumnarization >>>> > Subcolumnarization of variant columns allows query engines to >>>> efficiently prune datasets when subcolumns (i.e., nested fields) within a >>>> variant column are queried, and also allows optionally materializing some >>>> of the nested fields as a column on their own, affording queries on these >>>> subcolumns the ability to read less data and spend less CPU on extraction. >>>> When subcolumnarizing, the system managing table metadata and data tracks >>>> individual pruning statistics (min, max, null, etc.) for some subset of the >>>> nested fields within a variant, and also manages any optional >>>> materialization. Without subcolumnarization, any query which touches a >>>> variant column must read, parse, extract, and filter every row for which >>>> that column is non-null. Thus, by providing a standardized way of tracking >>>> subcolum metadata and data for variant columns, Iceberg can make >>>> subcolumnar optimizations accessible across various catalogs and query >>>> engines. >>>> > >>>> > Subcolumnarization is a non-trivial topic, so we expect any concrete >>>> proposal to include not only the set of changes to Iceberg metadata that >>>> allow compatible query engines to interopate on subcolumnarization data for >>>> variant columns, but also reference documentation explaining >>>> subcolumnarization principles and recommended best practices. >>>> > >>>> > >>>> > It sounds like the recent Geo proposal [3] may be a good starting >>>> point for how to approach this, so our plan is to write something up in >>>> that vein that covers the proposed spec changes, backwards compatibility, >>>> implementor burdens, etc. But we wanted to first reach out to the community >>>> to introduce ourselves and the idea, and see if there’s any early feedback >>>> we should incorporate before we spend too much time on a concrete proposal. >>>> > >>>> > Thank you! >>>> > >>>> > [1] >>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>>> > [2] >>>> https://github.com/apache/spark/blob/master/common/variant/README.md >>>> > [3] >>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>>> > >>>> > -Tyler, Nileema, Selcuk, Aihua >>>> > >>>> >>>