[ 
https://issues.apache.org/jira/browse/SPARK-29031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29031:
----------------------------------
    Affects Version/s:     (was: 3.0.0)
                       3.1.0

> Materialized column to accelerate queries
> -----------------------------------------
>
>                 Key: SPARK-29031
>                 URL: https://issues.apache.org/jira/browse/SPARK-29031
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Jason Guo
>            Priority: Major
>              Labels: SPIP
>
> Goals
>  * Add a new SQL grammar of Materialized column
>  * Implicitly rewrite SQL queries on the complex type of columns if there is 
> a materialized columns for it
>  * If the data type of the materialized columns is atomic type, even though 
> the origin column type is in complex type, enable vectorized read and filter 
> pushdown to improve performance
> Example
> Create a normal table
> {quote}CREATE TABLE x (
>     name STRING,
>     age INT,
>     params STRING,
>     event MAP<STRING, STRING>
> ) USING parquet;
> {quote}
>  
> Add materialized columns to an existing table
> {quote}ALTER TABLE x ADD COLUMNS (
>     new_age INT MATERIALIZED age + 1,
>     city STRING MATERIALIZED get_json_object(params, '$.city'),
>     label STRING MATERIALIZED event['label']
> );
> {quote}
>  
> When issue a query as below
> {quote}SELECT name, age+1, get_json_object(params, '$.city'), event['label']
> FROM x
> WHER event['label'] = 'newuser';
> {quote}
> It's equivalent to
> {quote}SELECT name, new_age, city, label
> FROM x
> WHERE label = 'newuser';
> {quote}
>  
> The query performance improved dramatically because
>  # The new query (after rewritten) will read the new column city (in string 
> type) instead of read the whole map of params(in map string). Much lesser 
> data are need to read
>  # Vectorized read can be utilized in the new query and can not be used in 
> the old one. Because vectorized read can only be enabled when all required 
> columns are in atomic type
>  # Filter can be pushdown. Only filters on atomic column can be pushdown. The 
> original filter  event['label'] = 'newuser' is on complex column, so it can 
> not be pushdown.
>  # The new query do not need to parse JSON any more. JSON parse is a CPU 
> intensive operation which will impact performance dramatically
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to