sunchao commented on PR #43629:
URL: https://github.com/apache/spark/pull/43629#issuecomment-1790038555

   > AFAIK, users are used to using REPAIR TABLE to update partition statistics.
   
   Hmm sorry I'm not aware that this is a common pattern among Spark users. 
However, it seems `REPAIR TABLE` is a bit more expensive than `ANALYZE TABLE` 
since it needs to list all the partitions under the table directory first, and 
process & validate them. In addition, it doesn't seem able to update row count 
for each partition too.
   
   > I think `ANALYZE TABLE` should update the whole statistics instead of 
partition statistics. How to only update the table whole statistics without 
partition statistics if we accepted this PR?
   
   Yea that's a valid question. I wonder what's the reason for users to only 
want to update table stats but not partition stats though: is it because 
updating the latter is significantly more expensive? In the `ANALYZE TABLE .. 
COMPUTE STATISTICS NOSCAN` case, the current implementation already collects 
the size in bytes for each partition and we just need to incur one extra HMS 
call (`alterPartitions`) to update the stats for these partitions.
   
   Alternatively, maybe we can introduce a new syntax `ANALYZE TABLE 
<tableName> PARTITIONS COMPUTE STATISTICS [NOSCAN]` to update both table and 
partition stats? 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to