[GitHub] spark pull request: [SPARK-2094][SQL] "Exactly once" semantics for...

liancheng Fri, 13 Jun 2014 00:27:16 -0700

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1071#discussion_r13740605
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDDLike.scala 
---
    @@ -48,7 +49,17 @@ private[sql] trait SchemaRDDLike {
        */
       @transient
       @DeveloperApi
    -  lazy val queryExecution = sqlContext.executePlan(logicalPlan)
    +  lazy val queryExecution = sqlContext.executePlan(baseLogicalPlan)
    +
    +  @transient protected[spark] val logicalPlan: LogicalPlan = 
baseLogicalPlan match {
    +    // For various commands (like DDL) and queries with side effects, we 
force query optimization to
    +    // happen right away to let these side effects take place eagerly.
    +    case _: Command | _: InsertIntoTable | _: InsertIntoCreatedTable | _: 
WriteToFile =>
    +      queryExecution.toRdd
    +      SparkLogicalPlan(queryExecution.executedPlan)
    +    case _ =>
    +      baseLogicalPlan
    +  }
    --- End diff --
    
    Realized that many `SchemaRDD` actions other than `collect()` and DSL 
methods reuses `logicalPlan` and breaks the "exactly once" constraints when 
planning the local plan (new physical plan node for DDL/command statements are 
created, causing the side effect taking place again).
    
    So I replaced `logicalPlan` with the executed physical plan wrapped with a 
`SparkLogicalPlan` to prevent multiple physical plan instantiations for the 
same DDL/command statement.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2094][SQL] "Exactly once" semantics for...

Reply via email to