sunchao commented on PR #43629: URL: https://github.com/apache/spark/pull/43629#issuecomment-1790038555
> AFAIK, users are used to using REPAIR TABLE to update partition statistics. Hmm sorry I'm not aware that this is a common pattern among Spark users. However, it seems `REPAIR TABLE` is a bit more expensive than `ANALYZE TABLE` since it needs to list all the partitions under the table directory first, and process & validate them. In addition, it doesn't seem able to update row count for each partition too. > I think `ANALYZE TABLE` should update the whole statistics instead of partition statistics. How to only update the table whole statistics without partition statistics if we accepted this PR? Yea that's a valid question. I wonder what's the reason for users to only want to update table stats but not partition stats though: is it because updating the latter is significantly more expensive? In the `ANALYZE TABLE .. COMPUTE STATISTICS NOSCAN` case, the current implementation already collects the size in bytes for each partition and we just need to incur one extra HMS call (`alterPartitions`) to update the stats for these partitions. Alternatively, maybe we can introduce a new syntax `ANALYZE TABLE <tableName> PARTITIONS COMPUTE STATISTICS [NOSCAN]` to update both table and partition stats? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
