[PR] [WIP][SPARK-47818] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests [spark]

via GitHub Thu, 11 Apr 2024 10:51:50 -0700


xi-db opened a new pull request, #46012:
URL: https://github.com/apache/spark/pull/46012


   ### What changes were proposed in this pull request?
   
   While building the DataFrame step by step, each time a new DataFrame is 
generated with an empty schema, which is lazily computed on access. However, if 
a user's code frequently accesses the schema of these new DataFrames using 
methods such as `df.columns`, it will result in a large number of Analyze 
requests to the server. Each time, the entire plan needs to be reanalyzed, 
leading to poor performance, especially when constructing highly complex plans.
   
   Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
overhead of repeated analysis during this process. This is achieved by saving 
significant computation if the resolved logical plan of a subtree of can be 
cached.
   
   A minimal example of the problem:
   
   ```
   import pyspark.sql.functions as F
   df = spark.range(10)
   for i in range(200):
     if str(i) not in df.columns: # <-- The df.columns call causes a new 
Analyze request in every iteration
       df = df.withColumn(str(i), F.col("id") + i)
   df.show() 
   ```
   
   With this patch, the performance of the above code improved from ~115s to 
~5s.
   
   
   ### Why are the changes needed?
   
   The performance improvement is huge.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, a static conf `spark.connect.session.planCache.maxSize` and a dynamic 
conf `spark.connect.session.planCache.enabled` are added.
   
   * `spark.connect.session.planCache.maxSize`: Sets the maximum number of 
cached resolved logical plans in Spark Connect Session. If set to a value less 
or equal than zero will disable the plan cache
   * `spark.connect.session.planCache.enabled`: When true, the cache of 
resolved logical plans is enabled if `spark.connect.session.planCache.maxSize` 
is greater than zero. When false, the cache is disabled even if 
`spark.connect.session.planCache.maxSize` is greater than zero. The caching is 
best-effort and not guaranteed.
   
   
   ### How was this patch tested?
   
   Some new tests are added in SparkConnectSessionHolderSuite.scala.
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [WIP][SPARK-47818] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests [spark]

Reply via email to