[jira] [Commented] (CALCITE-4918) Add a VARIANT data type

Mihai Budiu (Jira) Wed, 04 Sep 2024 10:22:47 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879309#comment-17879309
 ]


Mihai Budiu commented on CALCITE-4918:
--------------------------------------

Why VARIANT?

Currently the Calcite functions for processing JSON represent JSON as a string, 
and then parse the string into JSON, and then provide operations on the JSON. 
This works fine for small JSON strings, but becomes untenable for documents 
that can span hundreds of megabytes, since each function call that involves the 
JSON will need to parse it again and again. The VARIANT data type can be used 
as a "binary" representation of JSON which is dynamically-typed. The expected 
workflow for processing JSON (and other related data types, such as XML, YAML, 
etc) is to ingest strings, then parse them into VARIANT values, and then access 
various fields efficiently in the SQL program.

The PR above provides a possible implementation of VARIANT. 

Here are some implementation details and answers to some questions received in 
the PR

- How VARIANT interacts with JDBC

This part I don't understand very well. I don't know how standard JDBC adapters 
could represent variant data types (which I don't think are standard). The 
current implementation works fine with Avatica, since it returns directly 
references to Java objects of the newly-introduced Variant class. This is why 
the tests can be executed. I will be happy to take suggestions on how to 
encode/decode VARIANT types through JDBC. 

- Whether there is an 'instanceof' operator

The current PR does not provide any function to retrieve the actual runtime 
type of a variant object. This PR only includes the primitive built-in 
operations on VARIANT: casts, field access, and indexing. A rich assortment of 
functions is expected to be provided by future contributions.

The plan is to provide a function similar to the Snowflake TYPEOF function: 
https://docs.snowflake.com/en/sql-reference/functions/typeof, but perhaps 
initially only for VARIANT objects. (The function could be extended to operate 
on all kinds of expressions, as in Snowflake, but that's an independent issue.) 
This function seems important, because it would enable writing CASE statements 
(CASE TYPEOF(variant) WHEN "VARCHAR" ...), so I may add it in a commit to this 
PR.

- Behavior in CAST (and implicit conversion)

Currently there are no implicit conversions to and from VARIANT. The Snowflake 
design does have a few 
https://docs.snowflake.com/en/sql-reference/data-type-conversion#data-types-that-can-be-cast.
 I am assuming that these will be added in later PRs.

The CAST is the main way now one can create (and destructure) VARIANT objects. 
Here is a quote from the documentation that is part of the PR:

{quote}
Values of `VARIANT` type can be created by casting any other value to a 
`VARIANT`: e.g. `SELECT CAST(x AS VARIANT)`.  Conversely, values of type 
`VARIANT` can be cast to any other data type `SELECT CAST(variant AS INT)`.  A 
cast of a value of type `VARIANT` to target type T will compare the runtime 
type with T.  If the types are identical, the original value is returned.  
Otherwise the `CAST` returns `NULL`.
{quote}

What is unspecified here is "equality of runtime types". The current 
implementation compares the types including scale and precision, but I am 
thinking that this is a mistake. I plan to revisit this design decision, and 
also document it as part of the current PR. E.g., the runtime type should be 
DECIMAL and not DECIMAL(10, 5).

- How variants are converted to and from strings

Any string-valued value can be converted to a VARIANT using a CAST. Conversely, 
CAST(variant AS VARCHAR) willl retrieve the original string value stored in a 
VARIANT (assuming it is a string value).

However, for most other data types casting to a string is really a form 
serialization, but this cannot be done for VARIANT, since we reserved the 
behavior of casts as described in the above paragraph, and I think this is the 
right behavior. So we need a way to de/serialize values into VARIANT objects.

One of the main uses for VARIANT is to represent JSON documents. For parsing 
and unparsing JSON from VARIANT some helper functions will need to be provided, 
e.g., PARSE_JSON(s VARCHAR) RETURNS VARIANT and UNPARSE_JSON(v VARIANT) RETURNS 
VARCHAR.

However, VARIANTs are much more powerful than JSON, and I expect there could be 
other serialization/deserialization methods, e.g., to/from XML, YAML, binary 
formats?, etc. There could be many versions of such de/serialization functions. 
For example, JSON does not specify how DATE values are represented. Depending 
on the target application, one could have deserialization methods for VARIANT 
that produce different encodings of DATE in the produced JSON.

How VARIANT values used as table/view/query columns are de/serialized is also 
yet unspecified, and it may be best to remain unspecified, since different 
databases may have different expectations. We could chose one particular 
implementation for the enumerable-based runtime, but it doesn't have to be the 
only one.


> Add a VARIANT data type
> -----------------------
>
>                 Key: CALCITE-4918
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4918
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Assignee: Mihai Budiu
>            Priority: Major
>              Labels: pull-request-available
>
> Add a VARIANT data type. It would be a tagged union, and allow values of 
> multiple types to be stored, along with a type indicator so that the actual 
> type of each value can be discerned.
> VARIANT is not in the SQL standard. I know of three databases that have a 
> VARIANT type: MSSQL, Snowflake, Teradata. Calcite's VARIANT type will 
> probably draw from all three; however, the goal is not full compliance with 
> any of them, but useful functionality.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CALCITE-4918) Add a VARIANT data type

Reply via email to