chenhao-db opened a new pull request, #43707:
URL: https://github.com/apache/spark/pull/43707
## What changes were proposed in this pull request?
This PR adds Variant data type in Spark. It doesn't actually introduce any
binary encoding, but just has the `value` and `metadata` binaries.
This PR includes:
- The in-memory Omni representation in different types of Spark rows. All
rows except `UnsafeRow` use the `OmniVal` object to store an Omni value. In the
`UnsafeRow`, the two binaries are stored contiguously.
- Spark parquet writer and reader support for the Omni type. This is
agnostic to the detailed binary encoding but just transparently reads the two
binaries.
- A dummy Spark `parse_json` implementation so that I can manually test the
writer and reader. It currently returns an `OmniVal` with value being the raw
bytes of the input string and empty metadata. This is **not** a valid Omni
value in the final binary encoding.
## How was this patch tested?
Manual testing. Some supported usages:
```
> sql("create table T using parquet as select parse_json('1') as o")
> sql("select * from T").show
+---+
| o|
+---+
| 1|
+---+
> sql("insert into T select parse_json('[2]') as o")
> sql("select * from T").show
+---+
| o|
+---+
|[2]|
| 1|
+---+
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]