[
https://issues.apache.org/jira/browse/CALCITE-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879309#comment-17879309
]
Mihai Budiu commented on CALCITE-4918:
--------------------------------------
Why VARIANT?
Currently the Calcite functions for processing JSON represent JSON as a string,
and then parse the string into JSON, and then provide operations on the JSON.
This works fine for small JSON strings, but becomes untenable for documents
that can span hundreds of megabytes, since each function call that involves the
JSON will need to parse it again and again. The VARIANT data type can be used
as a "binary" representation of JSON which is dynamically-typed. The expected
workflow for processing JSON (and other related data types, such as XML, YAML,
etc) is to ingest strings, then parse them into VARIANT values, and then access
various fields efficiently in the SQL program.
The PR above provides a possible implementation of VARIANT.
Here are some implementation details and answers to some questions received in
the PR
- How VARIANT interacts with JDBC
This part I don't understand very well. I don't know how standard JDBC adapters
could represent variant data types (which I don't think are standard). The
current implementation works fine with Avatica, since it returns directly
references to Java objects of the newly-introduced Variant class. This is why
the tests can be executed. I will be happy to take suggestions on how to
encode/decode VARIANT types through JDBC.
- Whether there is an 'instanceof' operator
The current PR does not provide any function to retrieve the actual runtime
type of a variant object. This PR only includes the primitive built-in
operations on VARIANT: casts, field access, and indexing. A rich assortment of
functions is expected to be provided by future contributions.
The plan is to provide a function similar to the Snowflake TYPEOF function:
https://docs.snowflake.com/en/sql-reference/functions/typeof, but perhaps
initially only for VARIANT objects. (The function could be extended to operate
on all kinds of expressions, as in Snowflake, but that's an independent issue.)
This function seems important, because it would enable writing CASE statements
(CASE TYPEOF(variant) WHEN "VARCHAR" ...), so I may add it in a commit to this
PR.
- Behavior in CAST (and implicit conversion)
Currently there are no implicit conversions to and from VARIANT. The Snowflake
design does have a few
https://docs.snowflake.com/en/sql-reference/data-type-conversion#data-types-that-can-be-cast.
I am assuming that these will be added in later PRs.
The CAST is the main way now one can create (and destructure) VARIANT objects.
Here is a quote from the documentation that is part of the PR:
{quote}
Values of `VARIANT` type can be created by casting any other value to a
`VARIANT`: e.g. `SELECT CAST(x AS VARIANT)`. Conversely, values of type
`VARIANT` can be cast to any other data type `SELECT CAST(variant AS INT)`. A
cast of a value of type `VARIANT` to target type T will compare the runtime
type with T. If the types are identical, the original value is returned.
Otherwise the `CAST` returns `NULL`.
{quote}
What is unspecified here is "equality of runtime types". The current
implementation compares the types including scale and precision, but I am
thinking that this is a mistake. I plan to revisit this design decision, and
also document it as part of the current PR. E.g., the runtime type should be
DECIMAL and not DECIMAL(10, 5).
- How variants are converted to and from strings
Any string-valued value can be converted to a VARIANT using a CAST. Conversely,
CAST(variant AS VARCHAR) willl retrieve the original string value stored in a
VARIANT (assuming it is a string value).
However, for most other data types casting to a string is really a form
serialization, but this cannot be done for VARIANT, since we reserved the
behavior of casts as described in the above paragraph, and I think this is the
right behavior. So we need a way to de/serialize values into VARIANT objects.
One of the main uses for VARIANT is to represent JSON documents. For parsing
and unparsing JSON from VARIANT some helper functions will need to be provided,
e.g., PARSE_JSON(s VARCHAR) RETURNS VARIANT and UNPARSE_JSON(v VARIANT) RETURNS
VARCHAR.
However, VARIANTs are much more powerful than JSON, and I expect there could be
other serialization/deserialization methods, e.g., to/from XML, YAML, binary
formats?, etc. There could be many versions of such de/serialization functions.
For example, JSON does not specify how DATE values are represented. Depending
on the target application, one could have deserialization methods for VARIANT
that produce different encodings of DATE in the produced JSON.
How VARIANT values used as table/view/query columns are de/serialized is also
yet unspecified, and it may be best to remain unspecified, since different
databases may have different expectations. We could chose one particular
implementation for the enumerable-based runtime, but it doesn't have to be the
only one.
> Add a VARIANT data type
> -----------------------
>
> Key: CALCITE-4918
> URL: https://issues.apache.org/jira/browse/CALCITE-4918
> Project: Calcite
> Issue Type: Bug
> Reporter: Julian Hyde
> Assignee: Mihai Budiu
> Priority: Major
> Labels: pull-request-available
>
> Add a VARIANT data type. It would be a tagged union, and allow values of
> multiple types to be stored, along with a type indicator so that the actual
> type of each value can be discerned.
> VARIANT is not in the SQL standard. I know of three databases that have a
> VARIANT type: MSSQL, Snowflake, Teradata. Calcite's VARIANT type will
> probably draw from all three; however, the goal is not full compliance with
> any of them, but useful functionality.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)