[ 
https://issues.apache.org/jira/browse/SPARK-50229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-50229.
---------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 48762
[https://github.com/apache/spark/pull/48762]

> Reduce memory usage on driver for wide schemas by reducing the lifetime of 
> AttributeReference objects created during logical planning
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-50229
>                 URL: https://issues.apache.org/jira/browse/SPARK-50229
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer
>    Affects Versions: 4.0.0
>            Reporter: Utkarsh Agarwal
>            Assignee: Utkarsh Agarwal
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> The allAttributes method in QueryPlan 
> ([code|https://github.com/apache/spark/blob/57f6824e78e2e615778827ddebce9d7fcaae1698/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L2101])
>  unions the output of all of its children. Although this is okay in an 
> optimized plan, in a pre-optimized analyzed plan, these attributes add up 
> multiplicatively with the size of the plan. Because the `output` is usually 
> defined as a `def`, each node’s allAttributes also ends up with a distinct 
> copy of each attribute, potentially causing significant memory pressure on 
> the driver (especially under concurrency and with wide tables).
> Here is a simple example with TPC-DS Q42. SQL:
>  
> {code:sql}
> select dt.d_year, item.i_category_id, item.i_category, sum(ss_ext_sales_price)
>  from         date_dim dt, store_sales, item
>  where dt.d_date_sk = store_sales.ss_sold_date_sk
>       and store_sales.ss_item_sk = item.i_item_sk
>       and item.i_manager_id = 1
>       and dt.d_moy=11
>       and dt.d_year=2000
>  group by     dt.d_year
>               ,item.i_category_id
>               ,item.i_category
>  order by       sum(ss_ext_sales_price) desc,dt.d_year
>               ,item.i_category_id
>               ,item.i_category
>  limit 100
> {code}
> If we print out the size of each operator’s output and the size of its 
> `allAttributes`:
> {code:java}
> GlobalLimit: allAttrs: 4, output: 4
> LocalLimit: allAttrs: 4, output: 4
> Sort: allAttrs: 4, output: 4
> Aggregate: allAttrs: 73, output: 4
> Filter: allAttrs: 73, output: 73
> Join: allAttrs: 73, output: 73
> Join: allAttrs: 51, output: 51
> SubqueryAlias: allAttrs: 28, output: 28
> SubqueryAlias: allAttrs: 28, output: 28
> LogicalRelation: allAttrs: 0, output: 28
> SubqueryAlias: allAttrs: 23, output: 23
> LogicalRelation: allAttrs: 0, output: 23
> SubqueryAlias: allAttrs: 22, output: 22
> LogicalRelation: allAttrs: 0, output: 22{code}
>  
> Note how the joins and aggregate have 73 attributes each, by adding the width 
> of each relation. For queries with wide schemas, this issue is much worse. 
> Optimized plans after column pruning look far better:
> {code:java}
> Aggregate: allAttrs: 0, output: 1
> Project: allAttrs: 2, output: 0
> Join: allAttrs: 2, output: 2
> Project: allAttrs: 3, output: 1
> Join: allAttrs: 3, output: 3
> Project: allAttrs: 4, output: 2
> Join: allAttrs: 4, output: 4
> Project: allAttrs: 5, output: 3
> Join: allAttrs: 5, output: 5
> Project: allAttrs: 23, output: 4
> Filter: allAttrs: 23, output: 23
> LogicalRelation: allAttrs: 0, output: 23
> Project: allAttrs: 29, output: 1
> Filter: allAttrs: 29, output: 29
> LogicalRelation: allAttrs: 0, output: 29
> Project: allAttrs: 13, output: 1
> Filter: allAttrs: 13, output: 13
> LogicalRelation: allAttrs: 0, output: 13
> Project: allAttrs: 22, output: 1
> Filter: allAttrs: 22, output: 22
> LogicalRelation: allAttrs: 0, output: 22
> Project: allAttrs: 18, output: 1
> Filter: allAttrs: 18, output: 18
> LogicalRelation: allAttrs: 0, output: 18 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to