[
https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-45959:
----------------------------------
Target Version/s: (was: 4.1.0)
> SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf
> degradation
> ---------------------------------------------------------------------------------
>
> Key: SPARK-45959
> URL: https://issues.apache.org/jira/browse/SPARK-45959
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.5.1, 4.1.0, 4.0.0
> Reporter: Asif
> Priority: Major
> Labels: pull-request-available
>
> Though documentation clearly recommends to add all columns in a single shot,
> but in reality is difficult to expect customer to modify their code, as in
> spark2 the rules in analyzer were such that they did not do deep tree
> traversal. Moreover in Spark3 , the plans are cloned before giving to
> analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2
> - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just
> keeps adding new computation based on some rule.
> So, my suggestion is to Collapse the Projects early, once the analysis of
> the logical plan is done, but before the plan gets assigned to the field
> variable in QueryExecution.
> The PR for the above is ready for review.
> The major change is in the way the lookup is performed in CacheManager.
> I have described the logic in the PR and have added multiple tests.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]