Re: [Early Feedback] Variant and Subcolumnarization Support

Jack Ye Thu, 16 May 2024 12:45:38 -0700

+10000 for a JSON/BSON type. We also had the same discussion internally and
a JSON type would really play well with for example the SUPER type in
Redshift: https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
and can also provide better integration with the Trino JSON type.


Looking forward to the proposal!

Best,
Jack Ye


On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
<tyler.aki...@snowflake.com.invalid> wrote:

> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> wrote:
>
>> > We may need some guidance on just how many we need to look at;
>> > we were planning on Spark and Trino, but weren't sure how much
>> > further down the rabbit hole we needed to go。
>>
>> There are some engines living outside the Java world. It would be
>> good if the proposal could cover the effort it takes to integrate
>> variant type to them (e.g. velox, datafusion, etc.). This is something
>> that
>> some proprietary iceberg vendors also care about.
>>
>
> Ack, makes sense. We can make sure to share some perspective on this.
>
> > Not necessarily, no. As long as there's a binary type and Iceberg and
>> > the query engines are aware that the binary column needs to be
>> > interpreted as a variant, that should be sufficient.
>>
>> From the perspective of interoperability, it would be good to support
>> native
>> type from file specs. Life will be easier for projects like Apache
>> XTable.
>> File format could also provide finer-grained statistics for variant type
>> which
>> facilitates data skipping.
>>
>
> Agreed, there can definitely be additional value in native file format
> integration. Just wanted to highlight that it's not a strict requirement.
>
> -Tyler
>
>
>>
>> Gang
>>
>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
>> <tyler.aki...@snowflake.com.invalid> wrote:
>>
>>> Good to see you again as well, JB! Thanks!
>>>
>>> -Tyler
>>>
>>>
>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <j...@nanthrax.net>
>>> wrote:
>>>
>>>> Hi Tyler,
>>>>
>>>> Super happy to see you there :) It reminds me our discussions back in
>>>> the start of Apache Beam :)
>>>>
>>>> Anyway, the thread is pretty interesting. I remember some discussions
>>>> about JSON datatype for spec v3. The binary data type is already
>>>> supported in the spec v2.
>>>>
>>>> I'm looking forward to the proposal and happy to help on this !
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>> >
>>>> > Hello,
>>>> >
>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for
>>>> which we’d like to get early feedback from the community. As you may know,
>>>> Snowflake has embraced Iceberg as its open Data Lake format. Having made
>>>> good progress on our own adoption of the Iceberg standard, we’re now in a
>>>> position where there are features not yet supported in Iceberg which we
>>>> think would be valuable for our users, and that we would like to discuss
>>>> with and help contribute to the Iceberg community.
>>>> >
>>>> > The first two such features we’d like to discuss are in support of
>>>> efficient querying of dynamically typed, semi-structured data: variant data
>>>> types, and subcolumnarization of variant columns. In more detail, for
>>>> anyone who may not already be familiar:
>>>> >
>>>> > 1. Variant data types
>>>> > Variant types allow for the efficient binary encoding of dynamic
>>>> semi-structured data such as JSON, Avro, etc. By encoding semi-structured
>>>> data as a variant column, we retain the flexibility of the source data,
>>>> while allowing query engines to more efficiently operate on the data.
>>>> Snowflake has supported the variant data type on Snowflake tables for many
>>>> years [1]. As more and more users utilize Iceberg tables in Snowflake,
>>>> we’re hearing an increasing chorus of requests for variant support.
>>>> Additionally, other query engines such as Apache Spark have begun adding
>>>> variant support [2]. As such, we believe it would be beneficial to the
>>>> Iceberg community as a whole to standardize on the variant data type
>>>> encoding used across Iceberg tables.
>>>> >
>>>> > One specific point to make here is that, since an Apache OSS version
>>>> of variant encoding already exists in Spark, it likely makes sense to
>>>> simply adopt the Spark encoding as the Iceberg standard as well. The
>>>> encoding we use internally today in Snowflake is slightly different, but
>>>> essentially equivalent, and we see no particular value in trying to clutter
>>>> the space with another equivalent-but-incompatible encoding.
>>>> >
>>>> >
>>>> > 2. Subcolumnarization
>>>> > Subcolumnarization of variant columns allows query engines to
>>>> efficiently prune datasets when subcolumns (i.e., nested fields) within a
>>>> variant column are queried, and also allows optionally materializing some
>>>> of the nested fields as a column on their own, affording queries on these
>>>> subcolumns the ability to read less data and spend less CPU on extraction.
>>>> When subcolumnarizing, the system managing table metadata and data tracks
>>>> individual pruning statistics (min, max, null, etc.) for some subset of the
>>>> nested fields within a variant, and also manages any optional
>>>> materialization. Without subcolumnarization, any query which touches a
>>>> variant column must read, parse, extract, and filter every row for which
>>>> that column is non-null. Thus, by providing a standardized way of tracking
>>>> subcolum metadata and data for variant columns, Iceberg can make
>>>> subcolumnar optimizations accessible across various catalogs and query
>>>> engines.
>>>> >
>>>> > Subcolumnarization is a non-trivial topic, so we expect any concrete
>>>> proposal to include not only the set of changes to Iceberg metadata that
>>>> allow compatible query engines to interopate on subcolumnarization data for
>>>> variant columns, but also reference documentation explaining
>>>> subcolumnarization principles and recommended best practices.
>>>> >
>>>> >
>>>> > It sounds like the recent Geo proposal [3] may be a good starting
>>>> point for how to approach this, so our plan is to write something up in
>>>> that vein that covers the proposed spec changes, backwards compatibility,
>>>> implementor burdens, etc. But we wanted to first reach out to the community
>>>> to introduce ourselves and the idea, and see if there’s any early feedback
>>>> we should incorporate before we spend too much time on a concrete proposal.
>>>> >
>>>> > Thank you!
>>>> >
>>>> > [1]
>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>> > [2]
>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>> > [3]
>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>> >
>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>> >
>>>>
>>>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to