[
https://issues.apache.org/jira/browse/SPARK-50229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan resolved SPARK-50229.
---------------------------------
Fix Version/s: 4.0.0
Resolution: Fixed
Issue resolved by pull request 48762
[https://github.com/apache/spark/pull/48762]
> Reduce memory usage on driver for wide schemas by reducing the lifetime of
> AttributeReference objects created during logical planning
> -------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-50229
> URL: https://issues.apache.org/jira/browse/SPARK-50229
> Project: Spark
> Issue Type: Improvement
> Components: Optimizer
> Affects Versions: 4.0.0
> Reporter: Utkarsh Agarwal
> Assignee: Utkarsh Agarwal
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
>
> The allAttributes method in QueryPlan
> ([code|https://github.com/apache/spark/blob/57f6824e78e2e615778827ddebce9d7fcaae1698/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L2101])
> unions the output of all of its children. Although this is okay in an
> optimized plan, in a pre-optimized analyzed plan, these attributes add up
> multiplicatively with the size of the plan. Because the `output` is usually
> defined as a `def`, each node’s allAttributes also ends up with a distinct
> copy of each attribute, potentially causing significant memory pressure on
> the driver (especially under concurrency and with wide tables).
> Here is a simple example with TPC-DS Q42. SQL:
>
> {code:sql}
> select dt.d_year, item.i_category_id, item.i_category, sum(ss_ext_sales_price)
> from date_dim dt, store_sales, item
> where dt.d_date_sk = store_sales.ss_sold_date_sk
> and store_sales.ss_item_sk = item.i_item_sk
> and item.i_manager_id = 1
> and dt.d_moy=11
> and dt.d_year=2000
> group by dt.d_year
> ,item.i_category_id
> ,item.i_category
> order by sum(ss_ext_sales_price) desc,dt.d_year
> ,item.i_category_id
> ,item.i_category
> limit 100
> {code}
> If we print out the size of each operator’s output and the size of its
> `allAttributes`:
> {code:java}
> GlobalLimit: allAttrs: 4, output: 4
> LocalLimit: allAttrs: 4, output: 4
> Sort: allAttrs: 4, output: 4
> Aggregate: allAttrs: 73, output: 4
> Filter: allAttrs: 73, output: 73
> Join: allAttrs: 73, output: 73
> Join: allAttrs: 51, output: 51
> SubqueryAlias: allAttrs: 28, output: 28
> SubqueryAlias: allAttrs: 28, output: 28
> LogicalRelation: allAttrs: 0, output: 28
> SubqueryAlias: allAttrs: 23, output: 23
> LogicalRelation: allAttrs: 0, output: 23
> SubqueryAlias: allAttrs: 22, output: 22
> LogicalRelation: allAttrs: 0, output: 22{code}
>
> Note how the joins and aggregate have 73 attributes each, by adding the width
> of each relation. For queries with wide schemas, this issue is much worse.
> Optimized plans after column pruning look far better:
> {code:java}
> Aggregate: allAttrs: 0, output: 1
> Project: allAttrs: 2, output: 0
> Join: allAttrs: 2, output: 2
> Project: allAttrs: 3, output: 1
> Join: allAttrs: 3, output: 3
> Project: allAttrs: 4, output: 2
> Join: allAttrs: 4, output: 4
> Project: allAttrs: 5, output: 3
> Join: allAttrs: 5, output: 5
> Project: allAttrs: 23, output: 4
> Filter: allAttrs: 23, output: 23
> LogicalRelation: allAttrs: 0, output: 23
> Project: allAttrs: 29, output: 1
> Filter: allAttrs: 29, output: 29
> LogicalRelation: allAttrs: 0, output: 29
> Project: allAttrs: 13, output: 1
> Filter: allAttrs: 13, output: 13
> LogicalRelation: allAttrs: 0, output: 13
> Project: allAttrs: 22, output: 1
> Filter: allAttrs: 22, output: 22
> LogicalRelation: allAttrs: 0, output: 22
> Project: allAttrs: 18, output: 1
> Filter: allAttrs: 18, output: 18
> LogicalRelation: allAttrs: 0, output: 18 {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]