[jira] [Resolved] (SPARK-3693) Cached Hadoop RDD always return rows with the same value

Reynold Xin (JIRA) Thu, 25 Sep 2014 14:04:10 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Reynold Xin resolved SPARK-3693.
--------------------------------
    Resolution: Duplicate

> Cached Hadoop RDD always return rows with the same value
> --------------------------------------------------------
>
>                 Key: SPARK-3693
>                 URL: https://issues.apache.org/jira/browse/SPARK-3693
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Xuefu Zhang
>
> While trying RDD caching, it's found that caching a Hadoop RDD causes data 
> correctness issues. The following code snippet demonstrates the usage:
> {code}
> public final class Test {
>     public static void main(String[] args) throws Exception {
>         SparkConf sparkConf = new SparkConf().setAppName("Test");
>         JavaSparkContext ctx = new JavaSparkContext(sparkConf);
>         ... 
>         JavaPairRDD<BytesWritable, BytesWritable> input = 
>                 ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, 
> WritableComparable.class, Writable.class);
>         input = input.cache();
>         input.foreach(new VoidFunction<Tuple2<BytesWritable, 
> BytesWritable>>() {
>             @Override
>             public void call(Tuple2<BytesWritable, BytesWritable> row) throws 
> Exception {
>                 if (row._1() != null) {
>                     System.out.println("Key: " + row._1());
>                 }
>                 if (row._2() != null) {
>                     System.out.println("Value: " + row._2());
>                 }
>             }
>         });
>         ctx.stop();
>     }
> }
> {code}
> In this case, row._2() always gives the same value. If we disable caching by 
> removing input.cache(), the program gives the expected rows.
> Further analysis shows that MemoryStore (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236)
>  is storing the references to (key, value) pairs returned by 
> HadoopRDD.getNext() (See 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220),
>  but this method always returns the same (key, value) object references, 
> except each getNext() call updates values inside these objects. When there 
> are no more records (key, value) objects are filled with empty strings (no 
> values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer 
> to the same key, value object pairs, all values become NULL.
> Probably MemoryStore should instead store a copy of <key, value> pair rather 
> than keeping a reference to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-3693) Cached Hadoop RDD always return rows with the same value

Reply via email to