[
https://issues.apache.org/jira/browse/SPARK-29031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-29031:
----------------------------------
Affects Version/s: (was: 3.0.0)
3.1.0
> Materialized column to accelerate queries
> -----------------------------------------
>
> Key: SPARK-29031
> URL: https://issues.apache.org/jira/browse/SPARK-29031
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Jason Guo
> Priority: Major
> Labels: SPIP
>
> Goals
> * Add a new SQL grammar of Materialized column
> * Implicitly rewrite SQL queries on the complex type of columns if there is
> a materialized columns for it
> * If the data type of the materialized columns is atomic type, even though
> the origin column type is in complex type, enable vectorized read and filter
> pushdown to improve performance
> Example
> Create a normal table
> {quote}CREATE TABLE x (
> name STRING,
> age INT,
> params STRING,
> event MAP<STRING, STRING>
> ) USING parquet;
> {quote}
>
> Add materialized columns to an existing table
> {quote}ALTER TABLE x ADD COLUMNS (
> new_age INT MATERIALIZED age + 1,
> city STRING MATERIALIZED get_json_object(params, '$.city'),
> label STRING MATERIALIZED event['label']
> );
> {quote}
>
> When issue a query as below
> {quote}SELECT name, age+1, get_json_object(params, '$.city'), event['label']
> FROM x
> WHER event['label'] = 'newuser';
> {quote}
> It's equivalent to
> {quote}SELECT name, new_age, city, label
> FROM x
> WHERE label = 'newuser';
> {quote}
>
> The query performance improved dramatically because
> # The new query (after rewritten) will read the new column city (in string
> type) instead of read the whole map of params(in map string). Much lesser
> data are need to read
> # Vectorized read can be utilized in the new query and can not be used in
> the old one. Because vectorized read can only be enabled when all required
> columns are in atomic type
> # Filter can be pushdown. Only filters on atomic column can be pushdown. The
> original filter event['label'] = 'newuser' is on complex column, so it can
> not be pushdown.
> # The new query do not need to parse JSON any more. JSON parse is a CPU
> intensive operation which will impact performance dramatically
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]