wecharyu opened a new pull request, #44111:
URL: https://github.com/apache/spark/pull/44111

   ### What changes were proposed in this pull request?
   In this path, we introduce a new switch that enable filtering partitions in 
Spark side, and then get target partitions by the high performance API 
`Hive#getPartitionsByNames`.
   1. Add a switch `spark.sql.hive.getPartitionByName.enabled` that enable 
doing partition filter in Spark and get partitions by name through HMS.
   2. Unify the `listPartitionsByFilter` call through `ExternalCatalogUtils` to 
make sure most partition prunes can use the new switch.
   3. Implement `listPartitionsByNames` api in different catalogs.
   
   ### Why are the changes needed?
   `Hive#getPartitionsByFilter` API is low-performance and would cause Hive 
MetaStore backend DBS suffer heavy load if there are many calls to tables 
containing many partitions. There are mainly two advantages of this path:
   1. Improve the performance of `listPartitionsByFilter` when querying tables 
containing large partitions.
   2. Reduce the load on the db behind Hive MetaStore and maintain the health 
of HMS.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the user can turn on `spark.sql.hive.getPartitionByName.enabled` if the 
spark app needs do partition filter on tables containing large number of 
partitions. 
   
   
   ### How was this patch tested?
   Add a unit test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to