[GitHub] spark pull request: [SPARK-2393][SQL] Prototype implementation of ...

marmbrus Tue, 08 Jul 2014 12:57:13 -0700

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1238#discussion_r14676669
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
    @@ -26,6 +26,17 @@ import org.apache.spark.sql.catalyst.trees
     abstract class LogicalPlan extends QueryPlan[LogicalPlan] {
       self: Product =>
     
    +  protected class Estimates {
    +    lazy val childrenEstimations = children.map(_.estimates)
    +    lazy val cardinality: Long = childrenEstimations.map(_.cardinality).sum
    +    lazy val sizeInBytes: Long = childrenEstimations.map(_.sizeInBytes).sum
    --- End diff --
    
    A few notes here:
     - I think the most important thing about any default values that we set is 
that they work together correctly.  For example, we probably want the default 
size for RDDs to be large enough that they do not get broadcast using the 
default value.  For this reason we probably want to have one central location 
where defaults are set (probably in SQLConf).
     - Since RDDs are immutable, it may be possible to augment things so that 
we gather simple statistics the first time the RDD is queried, perhaps using an 
accumulator.  This however does not need to be done in this PR.  I would make a 
note of it though.
     - Regarding joins: for inner joins if one side is actually zero then the 
join will yield no results.  For outer joins the maximum possible output 
cardinality is the cardinality of the non-empty relation.  However, zero is 
kind of a weird estimate as if it is wrong it will really throw things off.  
For this reason maybe we should consider 1 to be the smallest possible 
cardinality estimate.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2393][SQL] Prototype implementation of ...

Reply via email to