[GitHub] [spark] cloud-fan opened a new pull request #25328: [SPARK-28595][SQL] explain should not trigger partition listing

GitBox Thu, 01 Aug 2019 09:50:33 -0700

cloud-fan opened a new pull request #25328: [SPARK-28595][SQL] explain should 
not trigger partition listing
URL: https://github.com/apache/spark/pull/25328
 
 
   ## What changes were proposed in this pull request?
   
   Sometimes when you explain a query, you will get stuck for a while. What's 
worse, you will get stuck again if you explain again.
   
   This is caused by `FileSourceScanExec`:
   1. In its `toString`, it needs to report the number of partitions it reads. 
This needs to query the hive metastore.
   2. In its `outputOrdering`, it needs to get all the files. This needs to 
query the hive metastore.
   
   This PR fixes by:
   1. `toString` do not need to report the number of partitions it reads. We 
should report it via SQL metrics.
   2. The `outputOrdering` is not very useful. We can only apply it if a) all 
the bucket columns are read. b) there is only one file in each bucket. This 
condition is really hard to meet, and even if we meet, sorting an already 
sorted file is pretty fast and avoiding the sort is not that useful. I think 
it's worth to give up this optimization so that explain don't need to get stuck.
   
   ## How was this patch tested?
   
   existing tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan opened a new pull request #25328: [SPARK-28595][SQL] explain should not trigger partition listing

Reply via email to