Hi,

I have run into a weird caching problem (Using Spark 1.3.1 + Java 1.8.0) that I 
can only explain as a bug.

In summary, I source the RDD from an Avro file, I apply a mapToPair Function, 
count & cache. However, the RDD is not cached nor it appears in Spark UI 
Storage. (This is not cached at all, not even partially)
                JavaSparkContext ctx = …;
JavaRDD a = ….;
JavaPairRDD b =  a.mapToPaiR(..).cache();
b.count(); //RDD is not cached.

I looked around but could not find any known bugs around this.

I debugged the b RDD and it is set as cached:
(80) MapPartitionsRDD[31] at mapToPair at ABC.java:684 [Memory Deserialized 1x 
Replicated]
|   RDD1 MapPartitionsRDD[22] at map at XXXAvroDao.java:xx [Memory Deserialized 
1x Replicated]
|   MapPartitionsRDD[21] at keys at XXXAvroDao.java:xx [Memory Deserialized 1x 
Replicated]
|   maprfs:/mapr/XXX NewHadoopRDD[20] at newAPIHadoopFile at XXXAvroDao.java:xx 
[Memory Deserialized 1x Replicated]

I also checked the b RDD storage level using a debugger and it seems correctly 
set as well.
StorageLevel(false, true, false, true, 1)

Now thing get more interesting as the following does result in cached rdd:
               a.cache().count();

Also the following works:
                ctx.parallelise(b.take(1000)).cache().count();

However, any attempts to “fool” b.cache() fail as well(action completes but 
data are not cached at all). E.g.
                b.repartition(150).cache().count();
b.values().cache().count();
b.keys().cache().count();
                b.persist(StorageLevel.DISK_ONLY()).count();
                b.persist(StorageLevel.MEMORY_ONLY()).count();
                b.persist(StorageLevel.MEMORY_ONLY_SER()).count();
b.unpersist().cache().count();


I haven’t managed to replicate the issue without the exact data, to be able to 
provide a reproducible example as it works just fine in any other data types I 
have or any example I tried.

Any ideas on where I should look?

Thanks.


This e-mail (including any attachments) is private and confidential, may 
contain proprietary or privileged information and is intended for the named 
recipient(s) only. Unintended recipients are strictly prohibited from taking 
action on the basis of information in this e-mail and must contact the sender 
immediately, delete this e-mail (and all attachments) and destroy any hard 
copies. Nomura will not accept responsibility or liability for the accuracy or 
completeness of, or the presence of any virus or disabling code in, this 
e-mail. If verification is sought please request a hard copy. Any reference to 
the terms of executed transactions should be treated as preliminary only and 
subject to formal written confirmation by Nomura. Nomura reserves the right to 
retain, monitor and intercept e-mail communications through its networks 
(subject to and in accordance with applicable laws). No confidentiality or 
privilege is waived or lost by Nomura by any mistransmission of this e-mail. 
Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, 
Inc. group. Please read our Electronic Communications Legal Notice which forms 
part of this e-mail: http://www.Nomura.com/email_disclaimer.htm

Reply via email to