I have a strong suspicion that it was caused by a disk full on the executor.
I am not sure if the executor was supposed to recover that way from it.

I cannot be sure about it, I should have had enough disk space, but I think
I had some data skew which could have lead to some executor to run out of
disk.

So, in case someone else notices a behavior like this, make sure you check
your cluster monitor (like ganglia).

On Wed, Jan 28, 2015 at 5:40 PM, Thomas Gerber <thomas.ger...@radius.com>
wrote:

> Hello,
>
> I am storing RDDs with the MEMORY_ONLY_SER Storage Level, during the run
> of a big job.
>
> At some point during the job, I went to the Executors page, and saw that
> 80% of my executors did not have stored RDDs anymore (executors.png). On
> the storage page, everything seems "there" (storage.png).
>
> But if I look at a given RDD (RDD_83.png), although it tells me on top
> that all 100 partitions are cached, when I look at the details, only 17 are
> actually stored (RDD_83_partitions), all on the 20% of executors that still
> had stored RDDs based on the Executors page.
>
> So I wonder:
> 1. Are those RDD still cached (in which case, we have a small reporting
> error), or not?
> 2. If not, what could cause an executor to drop its memory-stored RDD
> blocks?
>
> I guess a restart of an executor? When I compare an executor that seems to
> have dropped blocks vs one that has not:
> *** their
> *spark-hadoop-org.apache.spark.deploy.worker.Worker-1-ip-XX-XX-XX-XX.ec2.internal.out*
> content look the same
> *** they both have the same etime in ps (so, I guess no restart?)
> *** didn't see anything in the app log in the work folder (but it is
> large, so I might have missed it)
>
> Also, I must mention that the cluster was doing a lot of GCs, which might
> be a cause of the trouble.
>
> I would appreciate any pointer.
> Thomas
>
>

Reply via email to