[
https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-30334:
----------------------------------
Target Version/s: 3.1.0 (was: 3.0.0)
> Add metadata around semi-structured columns to Spark
> ----------------------------------------------------
>
> Key: SPARK-30334
> URL: https://issues.apache.org/jira/browse/SPARK-30334
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 2.4.4
> Reporter: Burak Yavuz
> Priority: Major
>
> Semi-structured data is used widely in the data industry for reporting events
> in a wide variety of formats. Click events in product analytics can be stored
> as json. Some application logs can be in the form of delimited key=value
> text. Some data may be in xml.
> The goal of this project is to be able to signal Spark that such a column
> exists. This will then enable Spark to "auto-parse" these columns on the fly.
> The proposal is to store this information as part of the column metadata, in
> the fields:
> - format: The format of the semi-structured column, e.g. json, xml, avro
> - options: Options for parsing these columns
> Then imagine having the following data:
> {code:java}
> +------------+-------+--------------------+
> | ts | event | raw |
> +------------+-------+--------------------+
> | 2019-10-12 | click | {"field":"value"} |
> +------------+-------+--------------------+ {code}
> SELECT raw.field FROM data
> will return "value"
> or the following data
> {code:java}
> +------------+-------+----------------------+
> | ts | event | raw |
> +------------+-------+----------------------+
> | 2019-10-12 | click | field1=v1|field2=v2 |
> +------------+-------+----------------------+ {code}
> SELECT raw.field1 FROM data
> will return v1.
>
> As a first step, we will introduce the function "as_json", which accomplishes
> this for JSON columns.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]