wecharyu commented on PR #39115:
URL: https://github.com/apache/spark/pull/39115#issuecomment-1367959487

   @dongjoon-hyun @Tagar @srielau thanks for your discussion and so sorry for 
my slow reply.
   I have also not found such syntax on other DBMS, my motivation is to add an 
optional trick to improve the performance in some cases. 
   Suppose an external partitioned table `test_db.test_tbl` has two partition 
columns `grass_region string` and `grass_date date`. We can improve performance 
with partition filter in such cases:
   1. the new data is ingested to hdfs every day, and we need to repair table 
for the new data.
   In this case, we only need add partitions of the new date, it will reduce 
the path list and unnecessary attempts to add partition for huge existing 
partitions:
    `MSCK REPAIR TABLE test_db.test_tbl ADD PARTITIONS 
(grass_date='2022-12-30')`
   2. If we does not need the data of one `grass_region` any more
   In this case, after we delete the hdfs path 
`hdfs://path/to/tbl/grass_region=region_X`, it does not need list all 
partitions while repair table with partition filter:
   `MSCK REPAIR TABLE test_db.test_tbl DROP PARTITIONS 
(grass_region='region_X')`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to