Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/1238#discussion_r14676669
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
---
@@ -26,6 +26,17 @@ import org.apache.spark.sql.catalyst.trees
abstract class LogicalPlan extends QueryPlan[LogicalPlan] {
self: Product =>
+ protected class Estimates {
+ lazy val childrenEstimations = children.map(_.estimates)
+ lazy val cardinality: Long = childrenEstimations.map(_.cardinality).sum
+ lazy val sizeInBytes: Long = childrenEstimations.map(_.sizeInBytes).sum
--- End diff --
A few notes here:
- I think the most important thing about any default values that we set is
that they work together correctly. For example, we probably want the default
size for RDDs to be large enough that they do not get broadcast using the
default value. For this reason we probably want to have one central location
where defaults are set (probably in SQLConf).
- Since RDDs are immutable, it may be possible to augment things so that
we gather simple statistics the first time the RDD is queried, perhaps using an
accumulator. This however does not need to be done in this PR. I would make a
note of it though.
- Regarding joins: for inner joins if one side is actually zero then the
join will yield no results. For outer joins the maximum possible output
cardinality is the cardinality of the non-empty relation. However, zero is
kind of a weird estimate as if it is wrong it will really throw things off.
For this reason maybe we should consider 1 to be the smallest possible
cardinality estimate.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---