[GitHub] [spark] cxzl25 commented on pull request #32583: [SPARK-35437][SQL] Hive partition filtering client optimization

GitBox Tue, 18 May 2021 08:48:56 -0700


cxzl25 commented on pull request #32583:
URL: https://github.com/apache/spark/pull/32583#issuecomment-843285679



   In our production environment, there is a partition table with about 80,000 
partitions, and dt is the partition field.
   
   In this case, Spark pulls all the partitions, which generates a lot of SQL 
for querying partitions, which puts a lot of pressure on the MetaStore Server, 
and the speed is more than 100 times slower than Hive.
   
   ```sql
   select a.*
   from X a
   where substr(a.dt,1,10) = '2018-01-07'
   limit 10;
   ```
   Hive: Time taken: 2.816 seconds, Fetched: 10 row(s)
   Spark: Time taken: 248 seconds, Fetched: 10 row(s)
   
   
![image](https://user-images.githubusercontent.com/3898450/118682555-23e53980-b833-11eb-8f11-aa9f64754aeb.png)
   
   
![image](https://user-images.githubusercontent.com/3898450/118682597-2e073800-b833-11eb-8294-69985c8a7954.png)
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cxzl25 commented on pull request #32583: [SPARK-35437][SQL] Hive partition filtering client optimization

Reply via email to