Jamie Hutton created SPARK-15000:
------------------------------------

             Summary: Spark hangs indefinitely if you cache a dataframe, then 
show it, then do some further processing on it
                 Key: SPARK-15000
                 URL: https://issues.apache.org/jira/browse/SPARK-15000
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.0, 1.5.2
         Environment: I am running the test code on both a hortonworks sandbox 
and also on AWS EMR / EC2. Issue occurs in both spark-submit and spark-shell
            Reporter: Jamie Hutton


There seems to be an issue with certain combinations of cache and show when 
using spark. If you read a parquet file from disk, cache it, then perform a 
show operation, the system will hang (forever) if you perform further 
processing on it. 

The following code replicates the issue. I have run it on multiple 
environments, two spark versions and in both spark-shell and spark-submit. 

/*create a dataframe for our test - i did this so the test was self contained 
but you can use any parquet format dataframe*/

val r = scala.util.Random
val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long]))
val distData = sc.parallelize(list)
import sqlContext.implicits._
val df=distData.toDF
df.write.format("parquet").mode("overwrite").save("df_hanging_test.parquet") 


/*Now read the dataframe back in -  this is where the test begins*/
val df2 = sqlContext.read.load("df_hanging_test.parquet")
df2.cache
df2.show
val groupresult=df2.groupBy("_2").agg(count("_1") as "count")
groupresult.show
/*the last step hangs forever*/

If you remove either the df2.cache or the df2.show lines the issue goes away. 
Also the groupBy/Agg doesnt seem to be the issue - I believe i have seen the 
same issue with other types of processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to