[ 
https://issues.apache.org/jira/browse/SPARK-17621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15509213#comment-15509213
 ] 

Sreelal S L commented on SPARK-17621:
-------------------------------------

Hi Sean, 
Thanks for your quick reply. 
I didnt understand what you meant by "evaluating usersDFwithCount twice". Does 
creating a DataFrame from existing RDD fires a extra job. 

One more catch is i am observing this only for orderBy() . 
If i try a groupBy , ie :  counterDF.groupBy("name").count().collect()  the 
accumulator value is proper. 
In groupBy case also i create DataFrame from the rdd. 

What could be the difference here. 

 


> Accumulator value is doubled when using DataFrame.orderBy()
> -----------------------------------------------------------
>
>                 Key: SPARK-17621
>                 URL: https://issues.apache.org/jira/browse/SPARK-17621
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, SQL
>    Affects Versions: 2.0.0
>         Environment: Development environment. (Eclipse . Single process) 
>            Reporter: Sreelal S L
>            Priority: Minor
>
> We are tracing the records read by our source using an accumulator.  We do a 
> orderBy on the Dataframe before the output operation. When the job is 
> completed, the accumulator values is becoming double of the expected value . 
> . 
> Below is the sample code i ran . 
> {code} 
>  val sqlContext = SparkSession.builder() 
>       .config("spark.sql.retainGroupColumns", 
> false).config("spark.sql.warehouse.dir", "file:///C:/Test").master("local[*]")
>       .getOrCreate()
>     val sc = sqlContext.sparkContext
>     val accumulator1 = sc.accumulator(0, "accumulator1")
>     val usersDF = sqlContext.read.json("C:\\users.json") //  single row 
> {"name":"sreelal" ,"country":"IND"}
>     val usersDFwithCount = usersDF.rdd.map(x => { accumulator1 += 1; x });
>     val counterDF = sqlContext.createDataFrame(usersDFwithCount, 
> usersDF.schema);
>     val oderedDF = counterDF.orderBy("name")
>     val collected = oderedDF.collect()
>     collected.foreach { x => println(x) }
>     println("accumulator1 : " + accumulator1.value)
>     println("Done");
> {code}
> I have only one row in the users.json file.  I expect accumulator1 to have 
> value 1. But its coming as 2. 
> In the Spark Sql UI , i see two jobs getting generated for the same. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to