[jira] [Commented] (SPARK-15000) Spark hangs indefinitely if you cache a dataframe, then show it, then do some further processing on it

Vijay Parmar (JIRA) Fri, 13 May 2016 19:16:40 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283379#comment-15283379
 ]


Vijay Parmar commented on SPARK-15000:
--------------------------------------

I ran the example code provided by you on Scala 1.6.1 and below is the output 
of that :-

scala> df2.cache
res8: df2.type = [_1: bigint, _2: bigint]

scala> df2.show
+---+---+
| _1| _2|
+---+---+
|  0|467|
|  1|315|
|  2|436|
|  3|193|
|  4|162|
|  5|495|
|  6|397|
|  7|223|
|  8|245|
|  9| 71|
| 10|  3|
| 11|464|
| 12|222|
| 13|471|
| 14|379|
| 15| 22|
| 16|176|
| 17| 79|
| 18| 82|
| 19|230|
+---+---+
only showing top 20 rows


scala> val groupresult=df2.groupBy("_2").agg(count("_1") as "count")
groupresult: org.apache.spark.sql.DataFrame = [_2: bigint, count: bigint]

scala> groupresult.show
+---+-----+                                                                     
| _2|count|
+---+-----+
| 31|    1|
|231|    2|
|432|    2|
|232|    1|
| 33|    2|
|234|    1|
|434|    2|
| 34|    1|
|435|    3|
| 35|    2|
|436|    2|
|236|    1|
| 37|    1|
|237|    1|
|239|    1|
|439|    1|
| 40|    1|
|240|    2|
|440|    2|
| 41|    1|
+---+-----+
only showing top 20 rows


scala> 


It seems to be an issue with the older version of Scala not with the current 
version i.e 1.6.1

> Spark hangs indefinitely if you cache a dataframe, then show it, then do some 
> further processing on it
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15000
>                 URL: https://issues.apache.org/jira/browse/SPARK-15000
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.2, 1.6.0
>         Environment: I am running the test code on both a hortonworks sandbox 
> and also on AWS EMR / EC2. Issue occurs in both spark-submit and spark-shell
>            Reporter: Jamie Hutton
>
> There seems to be an issue with certain combinations of cache and show when 
> using spark. If you read a parquet file from disk, cache it, then perform a 
> show operation, the system will hang (forever) if you perform further 
> processing on it. 
> The following code replicates the issue. I have run it on multiple 
> environments, two spark versions and in both spark-shell and spark-submit. 
> /*create a dataframe for our test - i did this so the test was self contained 
> but you can use any parquet format dataframe*/
> val r = scala.util.Random
> val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long]))
> val distData = sc.parallelize(list)
> import sqlContext.implicits._
> val df=distData.toDF
> df.write.format("parquet").mode("overwrite").save("df_hanging_test.parquet") 
> /*Now read the dataframe back in -  this is where the test begins*/
> val df2 = sqlContext.read.load("df_hanging_test.parquet")
> df2.cache
> df2.show
> val groupresult=df2.groupBy("_2").agg(count("_1") as "count")
> groupresult.show
> /*the last step hangs forever*/
> If you remove either the df2.cache or the df2.show lines the issue goes away. 
> Also the groupBy/Agg doesnt seem to be the issue - I believe i have seen the 
> same issue with other types of processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-15000) Spark hangs indefinitely if you cache a dataframe, then show it, then do some further processing on it

Reply via email to