Github user CodingCat commented on the issue:
https://github.com/apache/spark/pull/19864
@viirya yes, we can get more accurate stats later, however, the first stats
is also important as it enables the user to pay less for `the first run` which
writes cache.
The current implementation always chooses the most expensive plan in the
first run, e.g. always resort to sortmergejoin instead of broadcastjoin even it
is possible, CBO is actually disabled for any operator which locates in
downstream of InMemoryRelation. Additionally, it makes execution plan
inconsistent even for the same query over the same dataset. Of course, all of
these issues happen in the first run.
IMHO, we have a chance to make it better, why not?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]