Thierry Herrmann created SPARK-2607:
---------------------------------------

             Summary: SchemaRDD unionall prevents caching
                 Key: SPARK-2607
                 URL: https://issues.apache.org/jira/browse/SPARK-2607
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.0.0
         Environment: Linux vb2 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 
22:45:01 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

            Reporter: Thierry Herrmann


This driver program submitted with spark-submit:

{code:title=TestUnion.scala|borderStyle=solid}
val sc = new org.apache.spark.SparkContext(conf)
  val sqlCtx = new SQLContext(sc)
  val rddForDay1 = 
sqlCtx.parquetFile(s"hdfs://dell-715-09/user/hive/warehouse/mytable/uivr_year=2014/uivr_month=5/uivr_day=1")
  val rddForDay2 = 
sqlCtx.parquetFile(s"hdfs://dell-715-09/user/hive/warehouse/mytable/uivr_year=2014/uivr_month=5/uivr_day=2")
  rddForDay1.cache
  rddForDay2.cache
  rddForDay1 union rddForDay2 count
{code}

generates these line in the log, thanks to the .cache calls:

{noformat}
14/07/21 11:38:49 INFO BlockManagerInfo: Added rdd_1_0 in memory on 
dell-715-12.neura-local.com:39169 (size: 689.7 MB, free: 8.0 GB)
14/07/21 11:38:49 INFO BlockManagerInfo: Added rdd_0_0 in memory on 
dell-715-12.neura-local.com:39169 (size: 744.4 MB, free: 7.2 GB)
{noformat}

If I replace union with unionAll, these lines are not present anymore in the 
log which makes me think the RDDs are not cached anymore.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to