[
https://issues.apache.org/jira/browse/SPARK-47818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xi Lyu updated SPARK-47818:
---------------------------
Description:
While building the DataFrame step by step, each time a new DataFrame is
generated with an empty schema, which is lazily computed on access. However, if
a user's code frequently accesses the schema of these new DataFrames using
methods such as `df.columns`, it will result in a large number of Analyze
requests to the server. Each time, the entire plan needs to be reanalyzed,
leading to poor performance, especially when constructing highly complex plans.
Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the
overhead of repeated analysis during this process. This is achieved by saving
significant computation if the resolved logical plan of a subtree of can be
cached.
A minimal example of the problem:
{code:java}
import pyspark.sql.functions as F
df = spark.range(10)
for i in range(200):
if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze
request in every iteration
df = df.withColumn(str(i), F.col("id") + i)
df.show() {code}
was:
While building the DataFrame step by step, each time a new DataFrame is
generated with an empty schema, which is lazily computed on access. However, if
a user's code frequently accesses the schema of these new DataFrames using
methods such as `df.columns`, it will result in a large number of Analyze
requests to the server. Each time, the entire plan needs to be reanalyzed,
leading to poor performance, especially when constructing highly complex plans.
Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the
overhead of repeated analysis during this process. This is achieved by saving
significant computation if the resolved logical plan of a subtree of can be
cached.
> Introduce plan cache in SparkConnectPlanner to improve performance of Analyze
> requests
> --------------------------------------------------------------------------------------
>
> Key: SPARK-47818
> URL: https://issues.apache.org/jira/browse/SPARK-47818
> Project: Spark
> Issue Type: Improvement
> Components: Connect
> Affects Versions: 4.0.0
> Reporter: Xi Lyu
> Priority: Major
> Fix For: 4.0.0
>
>
> While building the DataFrame step by step, each time a new DataFrame is
> generated with an empty schema, which is lazily computed on access. However,
> if a user's code frequently accesses the schema of these new DataFrames using
> methods such as `df.columns`, it will result in a large number of Analyze
> requests to the server. Each time, the entire plan needs to be reanalyzed,
> leading to poor performance, especially when constructing highly complex
> plans.
> Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the
> overhead of repeated analysis during this process. This is achieved by saving
> significant computation if the resolved logical plan of a subtree of can be
> cached.
> A minimal example of the problem:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(10)
> for i in range(200):
> if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze
> request in every iteration
> df = df.withColumn(str(i), F.col("id") + i)
> df.show() {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]