Sreelal S L created SPARK-17621:
-----------------------------------

             Summary: Accumulator value is doubled when using 
DataFrame.orderBy()
                 Key: SPARK-17621
                 URL: https://issues.apache.org/jira/browse/SPARK-17621
             Project: Spark
          Issue Type: Bug
          Components: Scheduler, SQL
    Affects Versions: 2.0.0
         Environment: Development environment. (Eclipse . Single process) 
            Reporter: Sreelal S L
            Priority: Minor


We are tracing the records read by our source using an accumulator.  We do a 
orderBy on the Dataframe before the output operation. When the job is 
completed, the accumulator values is becoming double of the expected value . . 
Below is the sample code i ran . 

{code} 

 val sqlContext = SparkSession.builder() 
      .config("spark.sql.retainGroupColumns", 
false).config("spark.sql.warehouse.dir", "file:///C:/Test").master("local[*]")
      .getOrCreate()
    val sc = sqlContext.sparkContext
    val accumulator1 = sc.accumulator(0, "accumulator1")

    val usersDF = sqlContext.read.json("C:\\users.json") //  single row 
{"name":"sreelal" ,"country":"IND"}
    val usersDFwithCount = usersDF.rdd.map(x => { accumulator1 += 1; x });
    val counterDF = sqlContext.createDataFrame(usersDFwithCount, 
usersDF.schema);
    val oderedDF = counterDF.orderBy("name")

    val collected = oderedDF.collect()
    collected.foreach { x => println(x) }

    println("accumulator1 : " + accumulator1.value)
    println("Done");

{code}

I have only one row in the users.json file.  I expect accumulator1 to have 
value 1. But its coming as 2. 

In the Spark Sql UI , i see two jobs getting generated for the same. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to