AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-773151680


   > how big is the overhead? I had an impression that auto stats update is 
very expensive and not many people are using it...
   
   In origin way.
   
   1. We just update sizeInByte wont update rowCount, then if we want to get 
rowCount, we need to re-run analyze command, it's a bit overhead.
   2. When update sizeInByte, in origin logical, we  need to fetch all file 
status under target directory. Since spark always have small file problem. It's 
slow especially when user's HDFS is slow.
   
   In current way:
   
   1.  We collect row count info when write data and return it to driver throw  
`BasicWriteJobStatsTracker`, since we have discussed before  in 
https://github.com/apache/spark/pull/30026#issuecomment-709868109, we add 
return partition info to `BasicWriteJobStatsTracker` is not a concern. So in my 
pr, carry a `PartitionsStats` can't be a concern too.
   2. Just use metric data from `BasicWriteJobStatsTracker` to update statistic 
metadata. Now other behavior. In this way it should be faster then origin way 
since we don't need to fetch all file's status.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to