[PR] Add Variant data type in Spark. [spark]

via GitHub Tue, 07 Nov 2023 14:48:03 -0800


chenhao-db opened a new pull request, #43707:
URL: https://github.com/apache/spark/pull/43707


   ## What changes were proposed in this pull request?
   
   This PR adds Variant data type in Spark. It doesn't actually introduce any 
binary encoding, but just has the `value` and `metadata` binaries.
   
   This PR includes:
   - The in-memory Omni representation in different types of Spark rows. All 
rows except `UnsafeRow` use the `OmniVal` object to store an Omni value. In the 
`UnsafeRow`, the two binaries are stored contiguously. 
   - Spark parquet writer and reader support for the Omni type. This is 
agnostic to the detailed binary encoding but just transparently reads the two 
binaries.
   - A dummy Spark `parse_json` implementation so that I can manually test the 
writer and reader. It currently returns an `OmniVal` with value being the raw 
bytes of the input string and empty metadata. This is **not** a valid Omni 
value in the final binary encoding.
   
   ## How was this patch tested?
   
   Manual testing. Some supported usages:
   
   ```
   > sql("create table T using parquet as select parse_json('1') as o")
   > sql("select * from T").show
   +---+
   |  o|
   +---+
   |  1|
   +---+
   > sql("insert into T select parse_json('[2]') as o")
   > sql("select * from T").show
   +---+
   |  o|
   +---+
   |[2]|
   |  1|
   +---+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add Variant data type in Spark. [spark]

Reply via email to