[
https://issues.apache.org/jira/browse/SPARK-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283379#comment-15283379
]
Vijay Parmar commented on SPARK-15000:
--------------------------------------
I ran the example code provided by you on Scala 1.6.1 and below is the output
of that :-
scala> df2.cache
res8: df2.type = [_1: bigint, _2: bigint]
scala> df2.show
+---+---+
| _1| _2|
+---+---+
| 0|467|
| 1|315|
| 2|436|
| 3|193|
| 4|162|
| 5|495|
| 6|397|
| 7|223|
| 8|245|
| 9| 71|
| 10| 3|
| 11|464|
| 12|222|
| 13|471|
| 14|379|
| 15| 22|
| 16|176|
| 17| 79|
| 18| 82|
| 19|230|
+---+---+
only showing top 20 rows
scala> val groupresult=df2.groupBy("_2").agg(count("_1") as "count")
groupresult: org.apache.spark.sql.DataFrame = [_2: bigint, count: bigint]
scala> groupresult.show
+---+-----+
| _2|count|
+---+-----+
| 31| 1|
|231| 2|
|432| 2|
|232| 1|
| 33| 2|
|234| 1|
|434| 2|
| 34| 1|
|435| 3|
| 35| 2|
|436| 2|
|236| 1|
| 37| 1|
|237| 1|
|239| 1|
|439| 1|
| 40| 1|
|240| 2|
|440| 2|
| 41| 1|
+---+-----+
only showing top 20 rows
scala>
It seems to be an issue with the older version of Scala not with the current
version i.e 1.6.1
> Spark hangs indefinitely if you cache a dataframe, then show it, then do some
> further processing on it
> ------------------------------------------------------------------------------------------------------
>
> Key: SPARK-15000
> URL: https://issues.apache.org/jira/browse/SPARK-15000
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.5.2, 1.6.0
> Environment: I am running the test code on both a hortonworks sandbox
> and also on AWS EMR / EC2. Issue occurs in both spark-submit and spark-shell
> Reporter: Jamie Hutton
>
> There seems to be an issue with certain combinations of cache and show when
> using spark. If you read a parquet file from disk, cache it, then perform a
> show operation, the system will hang (forever) if you perform further
> processing on it.
> The following code replicates the issue. I have run it on multiple
> environments, two spark versions and in both spark-shell and spark-submit.
> /*create a dataframe for our test - i did this so the test was self contained
> but you can use any parquet format dataframe*/
> val r = scala.util.Random
> val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long]))
> val distData = sc.parallelize(list)
> import sqlContext.implicits._
> val df=distData.toDF
> df.write.format("parquet").mode("overwrite").save("df_hanging_test.parquet")
> /*Now read the dataframe back in - this is where the test begins*/
> val df2 = sqlContext.read.load("df_hanging_test.parquet")
> df2.cache
> df2.show
> val groupresult=df2.groupBy("_2").agg(count("_1") as "count")
> groupresult.show
> /*the last step hangs forever*/
> If you remove either the df2.cache or the df2.show lines the issue goes away.
> Also the groupBy/Agg doesnt seem to be the issue - I believe i have seen the
> same issue with other types of processing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]