I have a strong suspicion that it was caused by a disk full on the executor. I am not sure if the executor was supposed to recover that way from it.
I cannot be sure about it, I should have had enough disk space, but I think I had some data skew which could have lead to some executor to run out of disk. So, in case someone else notices a behavior like this, make sure you check your cluster monitor (like ganglia). On Wed, Jan 28, 2015 at 5:40 PM, Thomas Gerber <thomas.ger...@radius.com> wrote: > Hello, > > I am storing RDDs with the MEMORY_ONLY_SER Storage Level, during the run > of a big job. > > At some point during the job, I went to the Executors page, and saw that > 80% of my executors did not have stored RDDs anymore (executors.png). On > the storage page, everything seems "there" (storage.png). > > But if I look at a given RDD (RDD_83.png), although it tells me on top > that all 100 partitions are cached, when I look at the details, only 17 are > actually stored (RDD_83_partitions), all on the 20% of executors that still > had stored RDDs based on the Executors page. > > So I wonder: > 1. Are those RDD still cached (in which case, we have a small reporting > error), or not? > 2. If not, what could cause an executor to drop its memory-stored RDD > blocks? > > I guess a restart of an executor? When I compare an executor that seems to > have dropped blocks vs one that has not: > *** their > *spark-hadoop-org.apache.spark.deploy.worker.Worker-1-ip-XX-XX-XX-XX.ec2.internal.out* > content look the same > *** they both have the same etime in ps (so, I guess no restart?) > *** didn't see anything in the app log in the work folder (but it is > large, so I might have missed it) > > Also, I must mention that the cluster was doing a lot of GCs, which might > be a cause of the trouble. > > I would appreciate any pointer. > Thomas > >