fuwhu opened a new pull request #27129: [SPARK-30427] Add config item for 
limiting partition number when calculating statistics through File System
URL: https://github.com/apache/spark/pull/27129
 
 
   ### What changes were proposed in this pull request?
   Add config "spark.sql.statistics.fallBackToFs.maxPartitionNumber" and use it 
to control whether calculate statistics through file system.
   
   ### Why are the changes needed?
   Currently, when spark need to calculate the statistics (eg. sizeInBytes) of 
table partition through file system (eg. HDFS), it does not consider the number 
of partitions. Then if the the number of partitions is huge, it will cost much 
time to calculate the statistics which may be not be that useful.
   
   It should be reasonable to add a config item to control the limit of 
partition number allowable to calculate statistics through file system.
   
   
   ### Does this PR introduce any user-facing change?
   Yes, statistics of logical plan may be changed which may impact some spark 
strategies part, eg. JoinSelection.
   
   
   ### How was this patch tested?
   Added new unit test.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to