Xi Lyu created SPARK-47818:
------------------------------
Summary: Introduce plan cache in SparkConnectPlanner to improve
performance of Analyze requests
Key: SPARK-47818
URL: https://issues.apache.org/jira/browse/SPARK-47818
Project: Spark
Issue Type: Improvement
Components: Connect
Affects Versions: 4.0.0
Reporter: Xi Lyu
Fix For: 4.0.0
While building the DataFrame step by step, each time a new DataFrame is
generated with an empty schema. However, if a user's code frequently accesses
the schema of these new DataFrames using methods such as `df.columns`, it will
result in a large number of Analyze requests to the server. Each time, the
entire plan needs to be reanalyzed, leading to poor performance, especially
when constructing highly complex plans.
Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the
overhead of repeated analysis during this process. This is achieved by saving
significant computation if the resolved logical plan of a subtree of can be
cached.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]