[jira] [Created] (SPARK-47818) Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests

Xi Lyu (Jira) Thu, 11 Apr 2024 08:36:49 -0700

Xi Lyu created SPARK-47818:
------------------------------

             Summary: Introduce plan cache in SparkConnectPlanner to improve 
performance of Analyze requests
                 Key: SPARK-47818
                 URL: https://issues.apache.org/jira/browse/SPARK-47818
             Project: Spark
          Issue Type: Improvement
          Components: Connect
    Affects Versions: 4.0.0
            Reporter: Xi Lyu
             Fix For: 4.0.0



While building the DataFrame step by step, each time a new DataFrame is 
generated with an empty schema. However, if a user's code frequently accesses 
the schema of these new DataFrames using methods such as `df.columns`, it will 
result in a large number of Analyze requests to the server. Each time, the 
entire plan needs to be reanalyzed, leading to poor performance, especially 
when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
overhead of repeated analysis during this process. This is achieved by saving 
significant computation if the resolved logical plan of a subtree of can be 
cached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-47818) Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests

Reply via email to