[jira] [Comment Edited] (SPARK-19628) Duplicate Spark jobs in 2.1.0

Jork Zijlstra (JIRA) Thu, 16 Feb 2017 05:08:33 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869880#comment-15869880
 ]


Jork Zijlstra edited comment on SPARK-19628 at 2/16/17 1:07 PM:
----------------------------------------------------------------

I have just attached a screenshot which contains duplicate jobs when executing 
the above given example code. 

The example code uses show(), but in our application we use collect(). Both 
seem to trigger this duplication. 
The issue is that both jobs take time (they are executed sequentially), so the 
execution time has doubled for the same action.


was (Author: jzijlstra):
I have just attached a screenshot which contains duplicate jobs when executing 
the above given example code. 

The example code uses show(), but in our application we use collect(). Both 
seem to trigger this duplication. 
The issue is that both jobs take time, so the execution time has doubled for 
the same action.

> Duplicate Spark jobs in 2.1.0
> -----------------------------
>
>                 Key: SPARK-19628
>                 URL: https://issues.apache.org/jira/browse/SPARK-19628
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: Jork Zijlstra
>             Fix For: 2.0.1
>
>         Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, 
> spark2.1.0.png
>
>
> After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs 
> executed. Going back to Spark 2.0.1 they are gone again
> {code}
> import org.apache.spark.sql._
> object DoubleJobs {
>   def main(args: Array[String]) {
>     System.setProperty("hadoop.home.dir", "/tmp");
>     val sparkSession: SparkSession = SparkSession.builder
>       .master("local[4]")
>       .appName("spark session example")
>       .config("spark.driver.maxResultSize", "6G")
>       .config("spark.sql.orc.filterPushdown", true)
>       .config("spark.sql.hive.metastorePartitionPruning", true)
>       .getOrCreate()
>     sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
>     val paths = Seq(
>       ""//some orc source
>     )
>     def dataFrame(path: String): DataFrame = {
>       sparkSession.read.orc(path)
>     }
>     paths.foreach(path => {
>       dataFrame(path).show(20)
>     })
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-19628) Duplicate Spark jobs in 2.1.0

Reply via email to