AngersZhuuuu commented on pull request #31179: URL: https://github.com/apache/spark/pull/31179#issuecomment-773151680
> how big is the overhead? I had an impression that auto stats update is very expensive and not many people are using it... In origin way. 1. We just update sizeInByte wont update rowCount, then if we want to get rowCount, we need to re-run analyze command, it's a bit overhead. 2. When update sizeInByte, in origin logical, we need to fetch all file status under target directory. Since spark always have small file problem. It's slow especially when user's HDFS is slow. In current way: 1. We collect row count info when write data and return it to driver throw `BasicWriteJobStatsTracker`, since we have discussed before in https://github.com/apache/spark/pull/30026#issuecomment-709868109, we add return partition info to `BasicWriteJobStatsTracker` is not a concern. So in my pr, carry a `PartitionsStats` can't be a concern too. 2. Just use metric data from `BasicWriteJobStatsTracker` to update statistic metadata. Now other behavior. In this way it should be faster then origin way since we don't need to fetch all file's status. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
