Thierry Herrmann created SPARK-2607: ---------------------------------------
Summary: SchemaRDD unionall prevents caching Key: SPARK-2607 URL: https://issues.apache.org/jira/browse/SPARK-2607 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: Linux vb2 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:45:01 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Reporter: Thierry Herrmann This driver program submitted with spark-submit: {code:title=TestUnion.scala|borderStyle=solid} val sc = new org.apache.spark.SparkContext(conf) val sqlCtx = new SQLContext(sc) val rddForDay1 = sqlCtx.parquetFile(s"hdfs://dell-715-09/user/hive/warehouse/mytable/uivr_year=2014/uivr_month=5/uivr_day=1") val rddForDay2 = sqlCtx.parquetFile(s"hdfs://dell-715-09/user/hive/warehouse/mytable/uivr_year=2014/uivr_month=5/uivr_day=2") rddForDay1.cache rddForDay2.cache rddForDay1 union rddForDay2 count {code} generates these line in the log, thanks to the .cache calls: {noformat} 14/07/21 11:38:49 INFO BlockManagerInfo: Added rdd_1_0 in memory on dell-715-12.neura-local.com:39169 (size: 689.7 MB, free: 8.0 GB) 14/07/21 11:38:49 INFO BlockManagerInfo: Added rdd_0_0 in memory on dell-715-12.neura-local.com:39169 (size: 744.4 MB, free: 7.2 GB) {noformat} If I replace union with unionAll, these lines are not present anymore in the log which makes me think the RDDs are not cached anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)