[
https://issues.apache.org/jira/browse/SPARK-47818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-47818:
-----------------------------------
Labels: pull-request-available (was: )
> Introduce plan cache in SparkConnectPlanner to improve performance of Analyze
> requests
> --------------------------------------------------------------------------------------
>
> Key: SPARK-47818
> URL: https://issues.apache.org/jira/browse/SPARK-47818
> Project: Spark
> Issue Type: Improvement
> Components: Connect
> Affects Versions: 4.0.0
> Reporter: Xi Lyu
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
>
> While building the DataFrame step by step, each time a new DataFrame is
> generated with an empty schema, which is lazily computed on access. However,
> if a user's code frequently accesses the schema of these new DataFrames using
> methods such as `df.columns`, it will result in a large number of Analyze
> requests to the server. Each time, the entire plan needs to be reanalyzed,
> leading to poor performance, especially when constructing highly complex
> plans.
> Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the
> overhead of repeated analysis during this process. This is achieved by saving
> significant computation if the resolved logical plan of a subtree of can be
> cached.
> A minimal example of the problem:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(10)
> for i in range(200):
> if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze
> request in every iteration
> df = df.withColumn(str(i), F.col("id") + i)
> df.show() {code}
> With this patch, the performance of the above code improved from ~115s to ~5s.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]