Github user CodingCat commented on the issue:

    https://github.com/apache/spark/pull/19864
  
    @viirya yes, we can get more accurate stats later, however, the first stats 
is also important as it enables the user to pay less for `the first run` which 
writes cache. 
    
    The current implementation always chooses the most expensive plan in the 
first run, e.g. always resort to sortmergejoin instead of broadcastjoin even it 
is possible, CBO is actually disabled for any operator which locates in 
downstream of InMemoryRelation. Additionally, it makes execution plan 
inconsistent even for the same query over the same dataset. Of course, all of 
these issues happen in the first run.
    
    IMHO, we have a chance to make it better, why not? 
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to