Burak Yavuz created SPARK-30334:
-----------------------------------
Summary: Add metadata around semi-structured columns to Spark
Key: SPARK-30334
URL: https://issues.apache.org/jira/browse/SPARK-30334
Project: Spark
Issue Type: New Feature
Components: SQL
Affects Versions: 2.4.4
Reporter: Burak Yavuz
Semi-structured data is used widely in the data industry for reporting events
in a wide variety of formats. Click events in product analytics can be stored
as json. Some application logs can be in the form of delimited key=value text.
Some data may be in xml.
The goal of this project is to be able to signal Spark that such a column
exists. This will then enable Spark to "auto-parse" these columns on the fly.
The proposal is to store this information as part of the column metadata, in
the fields:
- format: The format of the semi-structured column, e.g. json, xml, avro
- options: Options for parsing these columns
Then imagine having the following data:
{code:java}
+------------+-------+--------------------+
| ts | event | raw |
+------------+-------+--------------------+
| 2019-10-12 | click | {"field":"value"} |
+------------+-------+--------------------+ {code}
SELECT raw.field FROM data
will return "value"
or the following data
{code:java}
+------------+-------+----------------------+
| ts | event | raw |
+------------+-------+----------------------+
| 2019-10-12 | click | field1=v1|field2=v2 |
+------------+-------+----------------------+ {code}
SELECT raw.field1 FROM data
will return v1.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]