Re: [DISCUSS] FLIP-521: Integrating Variant Type into Flink: Enabling Efficient Semi-Structured Data Processing

Xuannan Su Sun, 27 Apr 2025 03:01:31 -0700

Hi Timo，

Thanks for the comments!


1. We want the Variant interface to be introduced in the core API, so
we will not use the TableRuntimeException. And a more specific exception
has been added for Variant. Please see the FLIP.

2. I agree that `isScalar()` and `getBytes()` seems more appropriate.

3. This is a good question. After some investigation, I think the
generic is unnecessary.

4. That sounds good to me. I updated the names of the methods.

5. It means timestamp with no time zone, according to the Parquet
variant specification[1], where the type TIMESTAMP has time zone
information.
It is indeed confusing, so I updated the type to stick with the
convention of Flink type system.

6. The missing types are added. And I think we should implement within
the Flink code base for modelling the same binary layout, without
introducing a dependency on Parquet.

7. Package and module information are added.

8. I agree that we can use `JSON_STRING` for the `TO_JSON`. For the
`JSON` method to replace the `PARSE_JSON`, I cannot figure out how it
can work with the current JSON_OBJECT and JSON_ARRAY functions if it
returns a Variant. And we may want to have `PARSE_XML` in the future
that converts XML string to a Variant. Therefore, I am leaning toward
introducing the new `PARSE_JSON` method. What do you think?

9. The SQL/JSON item methods in the SQL standard look good. However, I
think this should be implemented in Calcite, instead of implementing
our own in Flink. And the syntax proposed in the FLIP is actually the
current implementation of the item method for the Variant type[2],
which we just need to put a trivial effort to support after we bump
the Calcite version.

10. After some investigation, I found that just backporting required
classes is not a trivial work. There are many classes to be copied to
the Flink repo. Therefore, instead of going for a non-trivial
temporary solution that will be removed later, we are better off just
bumping the Calcite version, which is inevitable.

11. Yes, the String representation of a variant is a JSON string. We
will put some limitations on casting from/to variant type. In short,
we only allow casting from/to scalar value variant, and an exception
will be thrown in case of type error if `CAST` is used, and null will
be returned if `TRY_CAST` is used. To answer your question,
CAST(variant AS ROW<i INT>) is not valid. And CAST(variant AS INT) is
valid, and it may throw a RuntimeException if the variant type cannot
be cast to an integer. I updated the FLIP to include these
information.


[1] https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
[2] https://calcite.apache.org/docs/reference.html#the-variant-type



On Fri, Apr 25, 2025 at 10:10 PM Timo Walther <twal...@apache.org> wrote:
>
> Hi Xuannan,
>
> sorry for the delay. This is a great addition but needs careful design.
> Here is some feedback that I found while reading the design doc:
>
> 1. Exception design
>
> The FLIP mentions IllegalStateException to be thrown. However, this is a
> very generic exception. We should rather use TableRuntimeException to
> indicate that this is a user's mistake. Or maybe even a dedicated new
> runtime exception type for variants.
>
> 2. Variant.isValue()
>
> The JavaDoc already indicates that a `isScalar()` might be more
> appropriate naming.
>
> 2. Variant.getBinary()
>
> Shall we call this getBytes()? This rather fits to the data type BYTES.
> We are also not calling `getString()` as `getVarchar()`.
>
> 3. VariantBuilder<T>
>
> Could you give some explanation why the VariantBuilder uses a generic?
> Maybe give an example usage in a UDF?
>
> 4. Variant.get() vs. Variant.getField
>
> We should be more explicit in the naming. How about we use
> `Variant.getField(String)` and `Variant.getElement(int)` instead? This
> would also allow us to introduce a `Object Variant.get()` that returns a
> value purely on the `getType` and a `<T> T getAs()` consistent with the
> method `getFieldAs` in the Row class.
>
> 5. Type.TIMESTAMP_NTZ
>
> Do you mean TIMESTAMP_LTZ here?
>
> 6. Implementation of Variant that follow the same layout as the Parquet
> Variant Encoding
>
> Parquet Variant Encoding supports int8, int16, int32. Shouldn't we also
> then support those types in `Variant` class like Short, Byte etc.? And
> especially I noticed that Integer is not supported? We should aim to be
> close to the existing type system.
>
> Also, will we add Parque dependencies or implement this within the Flink
> code base for modelling the same binary layout?
>
> 7. Packages / modules
>
> Please specify where you want to add the classes. Which package/module?
>
> 8. Built-in functions
>
> We recently introduced a `JSON()` function but reserved the use to
> `JSON_OBJECT` and `JSON_ARRAY` only. With variant type support, we can
> allow using this function at any location and return a VARIANT type. I
> guess it maps to `TRY_PARSE_JSON`?
>
> A TO_JSON function is not required as `JSON_STRING` already supports
> arbitrary types and just needs to be extended.
>
> 9. Extract values from a variant
>
> The SQL standard defined SQL/JSON item methods (T865–T878) [1]. It would
> make SQL more readable to support a method per data type.
> E.g. `SELECT integer(variantCol), timestamp(variantCol)`
>
> 10. Bumping Calcite
>
> The FLIP states that you want to bump the version of Apache Calcite to
> 1.39.0. Are you planning to just backport required classes or really
> upgrade to this version? If this is a full upgrade, we should do this in
> stages as history has shown that bumping Calcite is not a
> straight-forward task. But in any case, this would be great to do as we
> could benefit from other features such as lambda support.
>
> 11. String representation and casting rules
>
> Could you elaborate in the FLIP how CAST(variant AS ROW<i INT>) or
> CAST(variant AS INT) behave? Also could you elaborate how the
> `env.executeSql().print()` representation looks like? Will it be JSON?
> Also, Row.toString should support Variant type.
>
> Cheers,
> Timo
>
>
> [1]
> https://peter.eisentraut.org/blog/2023/04/04/sql-2023-is-finished-here-is-whats-new
>
>
> On 25.04.25 13:46, Xuannan Su wrote:
> > Hi everyone,
> >
> > Thank you for all the comments! If there are no further comments, I'd
> > like to close the discussion and start the voting next Monday.
> >
> > Best,
> > Xuannan
> >
> > On Fri, Apr 25, 2025 at 7:41 PM Lincoln Lee <lincoln.8...@gmail.com> wrote:
> >>
> >> +1 for this FLIP. VARIANT type support will be a great addition to sql.
> >> Look forward to the detailed design of the subsequent shredding
> >> optimizations.
> >>
> >>
> >> Best,
> >> Lincoln Lee
> >>
> >>
> >> Timo Walther <twal...@apache.org> 于2025年4月22日周二 16:51写道：
> >>
> >>> +1 for this feature. Having a VARIANT type makes a lot of sense and
> >>> together with an OBJECT type will make semi-structured data processing
> >>> in Flink easier.
> >>>
> >>> Currently, I'm catching up with notifications after the easter holidays,
> >>> but happy to give some feedback by tomorrow or Thursday as well.
> >>>
> >>> Thanks,
> >>> Timo
> >>>
> >>> On 22.04.25 10:40, Jingsong Li wrote:
> >>>> Thanks Xuannan for driving this discussion.
> >>>>
> >>>> At present, communities such as Apache Iceberg, Delta, Spark, Parquet,
> >>>> etc. are all designing and developing around Variant, and our Flink
> >>>> support for Variant is very valuable.
> >>>>
> >>>> After a rough look at the design, there is no overall problem. It is
> >>>> designed around Parquet's Variant standard, which is similar to the
> >>>> overall design of Spark SQL.
> >>>>
> >>>> +1 for this.
> >>>>
> >>>> Best,
> >>>> Jingsong
> >>>>
> >>>> On Mon, Apr 14, 2025 at 6:12 PM Xuannan Su <suxuanna...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>> Hi devs,
> >>>>>
> >>>>> I’d like to start a discussion around FLIP-521: Integrating Variant
> >>>>> Type into Flink: Enabling Efficient Semi-Structured Data
> >>>>> Processing[1]. Working with semi-structured data has long been a
> >>>>> foundational scenario of the Lakehouse. While JSON has traditionally
> >>>>> served as the primary storage format for such data, its implementation
> >>>>> as serialized strings introduces significant inefficiencies.
> >>>>>
> >>>>> In this FLIP, we integrate the Variant encoding, which is a compact
> >>>>> binary representation of semi-structured data[2], to improve the
> >>>>> performance of processing semi-structured data. As Paimon has
> >>>>> supported the Variant type recently[3], this FLIP would allow Flink to
> >>>>> further leverage Paimon's storage-layer optimizations, improving
> >>>>> performance and resource utilization for semi-structured data
> >>>>> pipelines.
> >>>>>
> >>>>> Best,
> >>>>> Xuannan
> >>>>>
> >>>>> [1]
> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-521%3A+Integrating+Variant+Type+into+Flink%3A+Enabling+Efficient+Semi-Structured+Data+Processing
> >>>>> [2]
> >>> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> >>>>> [3] https://github.com/apache/paimon/issues/4471
> >>>>
> >>>
> >>>
> >
>

Re: [DISCUSS] FLIP-521: Integrating Variant Type into Flink: Enabling Efficient Semi-Structured Data Processing

Reply via email to