Thierry Herrmann created SPARK-2607:
---------------------------------------
Summary: SchemaRDD unionall prevents caching
Key: SPARK-2607
URL: https://issues.apache.org/jira/browse/SPARK-2607
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.0.0
Environment: Linux vb2 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9
22:45:01 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Reporter: Thierry Herrmann
This driver program submitted with spark-submit:
{code:title=TestUnion.scala|borderStyle=solid}
val sc = new org.apache.spark.SparkContext(conf)
val sqlCtx = new SQLContext(sc)
val rddForDay1 =
sqlCtx.parquetFile(s"hdfs://dell-715-09/user/hive/warehouse/mytable/uivr_year=2014/uivr_month=5/uivr_day=1")
val rddForDay2 =
sqlCtx.parquetFile(s"hdfs://dell-715-09/user/hive/warehouse/mytable/uivr_year=2014/uivr_month=5/uivr_day=2")
rddForDay1.cache
rddForDay2.cache
rddForDay1 union rddForDay2 count
{code}
generates these line in the log, thanks to the .cache calls:
{noformat}
14/07/21 11:38:49 INFO BlockManagerInfo: Added rdd_1_0 in memory on
dell-715-12.neura-local.com:39169 (size: 689.7 MB, free: 8.0 GB)
14/07/21 11:38:49 INFO BlockManagerInfo: Added rdd_0_0 in memory on
dell-715-12.neura-local.com:39169 (size: 744.4 MB, free: 7.2 GB)
{noformat}
If I replace union with unionAll, these lines are not present anymore in the
log which makes me think the RDDs are not cached anymore.
--
This message was sent by Atlassian JIRA
(v6.2#6252)