Hi Phil,

This sounds a lot like a deadlock in Hadoop's Configuration object that I
ran into a while back.  If you jstack the JVM and see a thread that looks
like the below, it could be https://issues.apache.org/jira/browse/SPARK-2546

"Executor task launch worker-6" daemon prio=10 tid=0x00007f91f01fe000
nid=0x54b1 runnable [0x00007f92d74f1000]
   java.lang.Thread.State: RUNNABLE
    at java.util.HashMap.transfer(HashMap.java:601)
    at java.util.HashMap.resize(HashMap.java:581)
    at java.util.HashMap.addEntry(HashMap.java:879)
    at java.util.HashMap.put(HashMap.java:505)
    at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
    at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)
    at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662)


The fix for this issue is hidden behind a flag because it might have
performance implications, but if it is this problem then you can set
spark.hadoop.cloneConf=true and see if that fixes things.

Good luck!
Andrew

On Tue, Dec 23, 2014 at 9:40 AM, Phil Wills <otherp...@gmail.com> wrote:

> I've been attempting to run a job based on MLlib's ALS implementation for
> a while now and have hit an issue I'm having a lot of difficulty getting to
> the bottom of.
>
> On a moderate size set of input data it works fine, but against larger
> (still well short of what I'd think of as big) sets of data, I'll see one
> or two workers get stuck spinning at 100% CPU and the job unable to
> recover.
>
> I don't believe this is down to memory pressure as I seem to get the same
> behaviour at about the same size of input data, even if  the cluster is
> twice as large. GC logs also suggest things are proceeding reasonably with
> some Full GC's occurring, but no suggestion of the process being GC locked.
>
> After rebooting the instance that got into trouble, I can see the stderr
> log for the task truncated in the middle of a log-line at the time CPU
> shoots to and sticks at 100%, but no other signs of a problem.
>
> I've run into the same issue on 1.1.0 and 1.2.0 in standalone mode and
> running on YARN.
>
> Any suggestions on further steps I could try to get a clearer diagnosis of
> the issue would be much appreciated.
>
> Thanks,
>
> Phil
>

Reply via email to